Commit Graph

33 Commits

Author SHA1 Message Date
Kenny Van de Maele
1cd0aa2b8c feat(provider): add GitHub Copilot provider with device-flow auth (#1480)
* feat(provider): add GitHub Copilot provider with device-flow auth

Adds GitHub Copilot as a model provider, so Copilot models (gpt-4o/4.1/5,
Claude, Gemini, …) work through the normal chat + agent loop, incl. native
tool calling and vision.

Auth is one-click via the GitHub OAuth device flow; the access token is stored
as the endpoint's (encrypted) api_key and sent directly as `Authorization:
Bearer` (no Copilot-token exchange, no refresh — matching how editors talk to
the Copilot API). Copilot is a normal ModelEndpoint detected by host; the only
provider-specific behaviour is a small set of required request headers,
injected centrally.

Sign-in is available from Settings → model endpoints ("Connect GitHub
Copilot") and from chat via `/setup copilot`.

- src/copilot.py (new), routes/copilot_routes.py (new): constants, header
  builders, device-flow start/poll, model discovery, owner-scoped endpoint
  provisioning.
- src/llm_core.py, src/endpoint_resolver.py: detect `copilot`, inject headers,
  per-request x-initiator/vision.
- src/agent_loop.py: allowlist api.githubcopilot.com for native tool schemas.
- src/model_context.py: known context windows for Copilot (no unauthenticated
  /models probe).
- static/, README, tests/test_copilot*.py.

* Tidy copilot_routes: clarify supports_tools, note _PENDING is per-process
2026-06-04 21:13:14 +02:00
Giuseppe
6d511f6e66 fix(llm): auto-detect <think> in content stream for unregistered thinking models (#2588)
* fix(llm): auto-detect <think> in content stream for unregistered thinking models

_THINKING_MODEL_PATTERNS only covers known model families by name. Qwen3-derived
models with non-standard names (e.g. Qwopus, custom QwQ forks) are not matched,
so their <think>...</think> content streams through as visible chat text instead
of being routed to the thinking display.

When the first content delta opens with <think> and the model was not already
identified as a thinking model, dynamically flag the stream as a thinking model
for the remainder of the response. This enables the existing </think> repair path
(line below) and ensures the frontend receives the full <think>...</think> wrapper
it needs to split thinking from the final answer.

The check is restricted to the very first content delta (_first_content_sent is
False) to avoid misidentifying models that happen to write "<think>" mid-answer.

Fixes #2225
Related: #2420 (covered by separate PR from @AmmarS-Analyst), #2224 (@RaresKeY)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(llm): replace inert _thinking_model flag with _in_think_tag state machine

The original auto-detect set _thinking_model=True on the first <think> chunk
but still emitted it as a regular delta and set _first_content_sent=True
immediately, so no subsequent chunk could enter the repair path.

Replace with _in_think_tag bool: enter thinking mode when first content starts
with <think>, route all chunks to the thinking channel until </think> is found,
then the tail becomes the first regular delta. Adds three regression tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(llm): replace _first_content_sent guard with _think_open_stripped

Opening-tag stripping used `not _first_content_sent` as the guard, but
_first_content_sent stays False throughout the entire think block (it only
flips when regular content is emitted). So `find(">")` ran on every
reasoning chunk — not just the first — and silently truncated everything
before the first ">" in any reasoning text containing comparisons, arrows,
or code.

Fix: add `_think_open_stripped = False` alongside `_in_think_tag`. Use it
as the strip guard in both the "still inside <think>" path and the
"</think> found in same chunk" split path. Set it True once the opening
tag is consumed so all subsequent chunks reach the thinking channel
unmolested.

Add regression test: 3-chunk stream where the middle chunk contains
"c > d" — confirms "more c " is not dropped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 20:18:19 +02:00
Giuseppe
531f426557 fix: KeyError on missing 'content' key in system messages (#2362)
A system message that arrives without a 'content' key — possible via
malformed tool results — raised a KeyError in the hot path of llm_call,
llm_call_async, and stream_llm. Replace m["content"] with
m.get("content") or "" in all three functions so a missing key degrades
to an empty string instead of crashing.

Also removes a redundant .rstrip() after .strip() in _model_activity_key.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 19:38:45 +02:00
Giuseppe
ff8f9f2188 fix: llm_call_async does not retry on HTTP 429/502/503/504 (#2364)
The retry loop raised immediately for any non-success HTTP response
regardless of attempt count. For transient upstream errors (rate limit,
bad gateway, gateway timeout) the function should back off and retry
within the existing attempt budget.

Also lets ConnectError / ConnectTimeout retry when the host has not been
cooled and attempts remain, instead of always raising on the first
connect failure.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 19:35:55 +02:00
Giuseppe
bc9104efe2 fix: SSE stream parser crashes with NoneType on providers sending null choice/usage/tc entries (#2389)
* fix: SSE parser crashes with NoneType on MiniMax-M3 (and any provider sending null choice/usage/tc)

Three guards added in stream_llm:

1. choices[0] null check — MiniMax (and some other providers) send a
   choices entry as None. `_choices[0].get("delta")` raised
   AttributeError. Now checks `_choices[0] is not None` before calling
   .get().

2. usage null guard — j["usage"] can arrive as None (not a dict) on
   some providers. Added `or {}` so subsequent .get() calls don't crash.

3. tool_calls null entry skip — individual entries in the tool_calls
   array can be None. Added `if tc is None: continue` before
   tc.get("function").

All three match the `or {}` / null-guard pattern used elsewhere in the
same block. Safe for all OpenAI-compatible providers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: guard null choice in elif-choices SSE branch

The usage-chunk path already guarded _choices[0] is not None, but the
elif "choices" branch that processes content/tool-call deltas did not.
A chunk like {"choices": [null]} or {"choices": [null], "usage": null}
reaches j["choices"][0].get("delta") and crashes with:

    'NoneType' object has no attribute 'get'

Fix: extract choices[0] into _c0 and continue to the next chunk when
it is None, matching the guard already applied in the usage path.

Adds three focused regressions covering the paths the maintainer flagged:
- {"choices": [null]}
- {"choices": [null], "usage": null}
- tool_calls array containing a null entry alongside a valid call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 13:53:10 +01:00
tanmayraut45
f59edee611 Support extra CA bundle for private-CA LLM providers (#769)
Adding GigaChat (Sber) or an on-premise enterprise LLM gateway as a
model endpoint fails on first probe with

    CERTIFICATE_VERIFY_FAILED: self-signed certificate in certificate
    chain (_ssl.c:1000)

because their TLS chain is signed by a private root CA (Russian Trusted
Root CA for GigaChat; corporate CA for on-prem) that isn't part of the
default system / certifi trust store. The endpoint shows offline in
the picker even though the URL and API key are correct (issue #722).

The right fix is to extend the trust store, not to weaken verification.
This change:

- src/tls_overrides.py: new module that resolves an opt-in env var
  LLM_CA_BUNDLE at import time, builds a shared SSLContext via
  ssl.create_default_context() (so the system / certifi bundle is
  loaded first) and layers the operator's PEM on top with
  load_verify_locations(). Exposes llm_verify() returning a value
  suitable for httpx `verify=`. Defaults to True (httpx built-in
  trust) when the env var is unset, when the file is missing, or
  when the PEM fails to load — verification is never silently
  disabled, the warning is logged and we fall back to the safe path.

- src/llm_core.py: thread llm_verify() into the shared AsyncClient
  used by stream_llm / streaming completions.

- routes/model_routes.py: thread llm_verify() into the five httpx.get
  call sites in _probe_endpoint / _ping_endpoint so adding a
  private-CA endpoint goes green on the very first probe and the
  picker stops showing it offline.

- .env.example: document LLM_CA_BUNDLE with the GigaChat case as the
  concrete example.

Deliberately NOT included: a verify=False knob (global or per-host).
Disabling verification exposes the affected endpoint to MITM, and the
operator-supplied bundle is the correct fix for legitimate private-CA
providers — so the only switch in this PR is the safe one.

Closes #722.
2026-06-04 13:18:50 +01:00
Yuri
a2e691da2b fix(models): stabilize proxy endpoint refresh behavior
* fix: support large proxy model endpoint refresh

Large OpenAI-compatible proxy endpoints can expose hundreds of models and make /v1/models slow. Treating those endpoints like local model servers caused model picker opens and background probes to repeatedly hit /models, producing timeouts and making otherwise usable endpoints appear offline.

Make model endpoint discovery cached-first for normal UI usage, add explicit proxy/API classification and refresh policy fields, exclude proxy/API endpoints from aggressive local probing, and preserve cached models when refresh fails.

Manual Test/Add/Refresh actions still fetch the full model list with longer timeouts so users can intentionally import large proxy model lists without blocking normal model picker usage.

* fix: preserve endpoint ping status semantics
2026-06-04 04:56:11 +01:00
danielroytel
39848a168b fix: recognize Gemma 4 as a thinking model and add context entry (#1642)
Gemma 4 returns reasoning_content in streaming responses via
llama-server, but the model wasn't listed in _THINKING_MODEL_PATTERNS,
causing reasoning tokens to be mishandled. Add "gemma" to the pattern
list and register Gemma 4's 128K context window in KNOWN_CONTEXT_WINDOWS
so the agent loop budgets context correctly.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 14:23:18 +09:00
Afonso Coutinho
19e62208d2 fix: streaming drops providers that emit SSE data lines with no space (#1701) 2026-06-03 13:37:14 +09:00
Afonso Coutinho
3da4edb442 fix: token usage dropped when it rides on a non-empty finish delta (#1703) 2026-06-03 13:36:57 +09:00
lekt8
126e91e8b9 Don't attempt the same (url, model) route twice in the fallback chains (#1733)
The fallback helpers (llm_call_with_fallback, llm_call_async_with_fallback,
stream_llm_with_fallback) build their candidate list as the primary target
followed by the configured fallbacks. Callers prepend the session's live
(url, model) to default_model_fallbacks, so if the user also lists their current
model among the fallbacks — a common misconfiguration — the chain re-attempts
the very route that just failed: a wasted round-trip (and, for the streaming
path, a spurious 'fallback' notice for a switch that didn't actually happen).

Add a small _dedupe_candidates() helper that filters malformed entries and drops
a later repeat of an already-seen (url, model), preserving order (first wins,
keeping its headers). Apply it in all three fallback chains.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 13:33:50 +09:00
Ethan
b9c382006e Clamp Anthropic temperature to [0.0, 1.0] in _build_anthropic_payload (#1737)
Anthropic's Messages API rejects temperature > 1.0 with HTTP 400, but
_build_anthropic_payload forwarded it verbatim. The shipped "Nietzsche" preset
uses temperature 1.2 and the UI slider allows up to 2.0, so every Claude request
under such a preset hard-broke. Clamp into [0.0, 1.0] in the Anthropic builder
only (OpenAI keeps its wider 0.0-2.0 range). Covers all three Anthropic call
paths, which build through this one function. None is passed through unchanged.

Fixes #1615

Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>
2026-06-03 13:29:36 +09:00
Shreyas S Joshi
7504fedb17 fix: surface reasoning_content when content is empty (thinking models) (#1233)
Thinking models served via llama.cpp without --reasoning-format none
(e.g. Qwen3, DeepSeek-R1) route all tokens into reasoning_content and
return content="". Two call paths were silently broken:

- llm_call / llm_call_async (non-streaming): hard-keyed
  data["choices"][0]["message"]["content"] raises KeyError or returns
  empty string, discarding the entire response.

- stream_agent_loop end-of-round fallback: when full_response is empty
  but round_reasoning has content, the existing code replaced the
  response with the generic empty-response error message, discarding
  all reasoning tokens that were correctly accumulated during streaming.

Fix: in both non-streaming paths use msg.get("content") or
msg.get("reasoning_content") or "". In the streaming fallback, surface
round_reasoning as the answer before falling through to the error path.
2026-06-03 01:41:24 +09:00
Afonso Coutinho
65751186bd fix: merging consecutive user messages corrupts multimodal (image) content (#1277)
* fix: preserve multimodal content blocks when merging consecutive user messages

* test: consecutive user-message merge keeps multimodal image blocks
2026-06-03 01:21:57 +09:00
Afonso Coutinho
a04553013d fix: Anthropic responses with multiple text blocks lose all but the first (#1255)
* fix: concatenate all Anthropic text blocks, not just the first

* test: Anthropic response parsing concatenates text blocks
2026-06-03 00:57:20 +09:00
pewdiepie-archdaemon
ff93a6c63b Polish email and cookbook flows 2026-06-02 22:42:07 +09:00
SurprisedDuck
934bca9e48 Providers: omit temperature for OpenAI reasoning models
* fix: omit temperature for OpenAI reasoning models (o1/o3/o4/gpt-5)

These models only accept the default temperature; sending any explicit
value (even 0.0) returns HTTP 400 "Only the default (1) value is
supported". This broke two paths:

- Endpoint probing in _probe_single_model hardcodes temperature: 0.0, so
  a perfectly valid o3/gpt-5 endpoint is reported as failing in the
  Model Endpoints health check.
- Chat/stream payloads send temperature unconditionally, so a non-default
  temperature preset 400s on these models.

The code already special-cases the same model family for
max_completion_tokens, so this adds a sibling _restricts_temperature()
helper and omits the field for those models, letting the API use its
required default. gpt-4.5 is intentionally excluded (not a reasoning
model; accepts temperature normally).

Adds tests/test_llm_core_temperature.py covering the predicate and the
synchronous payload builder.

* fix: also omit temperature for reasoning models on the direct-POST paths

The first commit only covered llm_call/llm_call_async/stream_llm and the
endpoint probe. Email auto-summary, urgency-less spam classification, the
email reply-summary endpoint, and gallery vision tagging build their
OpenAI payloads inline and POST them directly (requests/httpx), bypassing
llm_core — so a reasoning model configured there would still 400 on the
temperature field. These sites already branch on _uses_max_completion_tokens,
so they're the same class; added the matching _restricts_temperature guard.

gallery_routes also gains the max_completion_tokens branch it was missing,
so gpt-5 vision tagging works end to end.

Note: email_pollers urgency scoring goes through llm_call_async and was
already covered.
2026-06-02 20:58:33 +09:00
Leo
6c15dc7d33 Chat metrics: surface backend generation speed
* Chat metrics: show backend's true generation t/s, not tokens÷wall-clock

The per-message tokens/sec read low and felt wrong because it was computed as
output_tokens / total_duration, where total_duration is wall-clock including
prefill, tool calls, and network — not pure decode time. llama.cpp already
reports the correct gen speed in its stream (timings.predicted_per_second), but
it was being dropped.

- llm_core.py: when parsing the OpenAI-compatible usage chunk, also read the
  sibling `timings` block llama.cpp includes — pass predicted_per_second through
  as gen_tps and prompt_per_second as prefill_tps on the usage event.
- agent_loop.py: capture backend_gen_tps/backend_prefill_tps from usage events;
  in _compute_final_metrics prefer backend_gen_tps over the wall-clock division
  when present (fall back to computed for cloud APIs that omit timings). Tag the
  result with tps_source ("backend" vs "computed") and surface prefill_tps.

Result: the displayed t/s now matches the model's real decode speed and is
stable regardless of prompt length (a long prefill no longer deflates it).

Checks: py_compile passes; verified extraction against a real llama.cpp final
chunk (gen 79 t/s surfaced vs the deflated wall-clock figure shown before).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Chat metrics: surface true t/s on the direct-chat path too

Follow-up to the gen-tps work: the non-agent direct-chat stream path in
chat_routes turned the raw `usage` event straight into a metrics event but only
copied token counts — it never set tokens_per_second or response_time. So simple
(non-tool) replies showed "Speed: n/a" / "Time: undefineds" and the chip fell
back to a bare token count ("27 tok") instead of t/s.

Map the usage event's gen_tps (llama.cpp timings.predicted_per_second, added in
the prior commit) into tokens_per_second here too, tag tps_source=backend, and
set response_time from wall-clock for the stats popup.

Checks: py_compile passes; verified llama.cpp emits usage+timings on the final
stream chunk (gen ~90 t/s) that this path consumes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Tests: backend gen/prefill t/s passthrough and preference

Cover the two pieces of the true-t/s metric so it can be reviewed on its own:
- stream_llm surfaces llama.cpp's timings.predicted_per_second /
  prompt_per_second as gen_tps / prefill_tps on the usage event (captured
  llama.cpp final-chunk fixture), and omits them when the backend reports no
  timings.
- _compute_final_metrics prefers backend_gen_tps over output/wall-clock,
  tags tps_source ("backend" vs "computed"), and surfaces prefill_tps.

Reuses the fake-client stream harness from test_llm_core_streaming.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 20:52:08 +09:00
Tatlatat
e084dc993e Chat: merge consecutive user messages for strict providers
After a non-native tool round, the agent appends tool results as a {role:
'user'} message next to the user's original 'user' prompt, producing two
consecutive 'user' messages. Strict provider APIs (Anthropic/Claude) reject
consecutive same-role messages, so the follow-up generation request fails
silently — search returns sources, then nothing is generated.

_sanitize_llm_messages now merges consecutive 'user' messages (joining their
content). Only user/user is merged; normal chat and agent/tool turns already
alternate and are untouched.

Scoped down per maintainer review: the agent_loop 'output' source-extraction
change is already on main (#898/#901) and the broad-mocking web-sources test
was dropped. Added a focused test that runs consecutive-user messages through
the real _build_anthropic_payload and asserts the payload alternates correctly.
2026-06-02 20:44:13 +09:00
Ernest Hysa
a8a34bd22a Ollama: pass discovered num_ctx in chat requests
_build_ollama_payload sends options.temperature and options.num_predict
to /api/chat, but never options.num_ctx. Ollama defaults num_ctx to 2048
when the option is omitted, so prompts going to any Ollama backend are
silently truncated there regardless of the model's actual capability.

Thread the discovered context length through the three call sites
(llm_call, llm_call_async, stream_llm) and emit options.num_ctx when it
is known and positive. The builder filters out the DEFAULT_CONTEXT
fallback (128000) so we don't lie to Ollama about models whose window
we couldn't actually discover. The issue's literal 'when > 2048'
heuristic is dropped: a model with a real context smaller than 2048
would OOM if Ollama used its default, so we pass the real value
regardless of size. Matches how src/context_compactor.py uses the
same helper.

Sister fix to PR #753 — that PR teaches the compactor the right budget,
this one tells Ollama to actually use that budget on the way in.
2026-06-02 20:27:24 +09:00
nsgds
5645cce6d0 Support vLLM 0.20.2 / NIM reasoning-parser output end-to-end (surface + agent context + render) (#602)
* fix(stream): read 'reasoning' SSE field for vLLM 0.20.2 / NIM

vLLM 0.20.2 / NVIDIA NIM emit reasoning-parser output in the `reasoning` delta field; older builds use `reasoning_content`. stream_llm() read only the latter, so reasoning from models like Nemotron-3-Nano (--reasoning-parser) was silently dropped and never rendered. Accept either field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent): keep reasoning_content only on the latest assistant turn

The agent loop echoed each round's reasoning back as `reasoning_content` on every assistant turn, assuming vendors ignore it. Nemotron's chat template re-injects ALL prior reasoning_content as <think> blocks, and the loop is trimmed only once (before it starts) — so reasoning accumulated unbounded across rounds, bloating context and feeding the model its own prior reasoning, which reinforced repetition/looping. Strip reasoning_content from earlier assistant turns so only the most recent round carries it (still satisfies DeepSeek's thinking-mode follow-up requirement).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent-ui): wrap each round's reasoning in its own <think> block

The streamed think-tag wrapper gated on whole-message substring checks (accumulated.includes('<think>')), which only ever wrapped ONE reasoning block per message. A multi-round agent response has a reasoning phase per round, so once round 1 closed its <think>...</think>, rounds 2+ reasoning was emitted unwrapped and leaked into the visible answer. Replace the substring checks with a stateful open/close flag that toggles per think/answer cycle, so each round's reasoning gets its own collapsible block. Single-turn chat is unchanged (one open, one close).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(stream): reasoning/reasoning_content delta surfaces as thinking chunk

Covers @pewdiepie-archdaemon's requested regression: a streamed {reasoning: ...} delta emits a thinking chunk while {content: ...} streams as normal content; plus the older reasoning_content field for backward compat. Mirrors the #591 scenario.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 11:48:17 +09:00
James Arslan
a327df6936 Fix native tool-calling follow-up round on Gemini and Ollama (#867)
The agent's multi-round (tool-result) follow-up request was rejected with
HTTP 400 on two providers, so tools ran but the agent never produced an answer:

- OpenAI-compatible streaming (Gemini 3) dropped the per-call thought_signature
  and collided parallel tool calls, which arrive with index=None: they all
  landed in slot 0, overwriting the first call's name and corrupting its
  arguments by concatenation, so the follow-up request 400'd. Capture and replay
  each call's extra_content (thought_signature), and give every parallel call
  its own accumulator slot (allocated above the max key, so sparse or mixed
  indices can't collide).
- Native Ollama /api/chat expects object tool-call arguments, but Odysseus
  carries them as a JSON string, which Ollama rejected ("Value looks like
  object, but can't find closing '}' symbol"). Convert them to objects in the
  Ollama payload builder.

Both compose with the no-prose null-content sanitize fix from #862.

Tested: python -m pytest tests/test_llm_core_streaming.py
tests/test_llm_core_ollama.py tests/test_agent_loop.py (53 pass), and
python -m py_compile src/llm_core.py src/agent_loop.py.
2026-06-02 11:39:40 +09:00
James Arslan
6776c7d691 Surface silent model fallback instead of masking it (#868)
When the selected model fails before producing output, stream_llm_with_fallback
quietly switches to the next candidate and the reply is shown under the
originally selected model's name, so a misconfigured provider looks like it
works. (Concretely: a Bedrock gateway that 400s every Anthropic/Claude request
appears fine because another model silently answers under the Claude label.)

Emit a `fallback` SSE event ({selected_model, answered_by, reason}) the first
time a non-primary candidate produces output, forward it through the agent loop
and both chat-route paths, stamp the response metrics with the model that
actually answered, and show a notice + relabel the reply in the UI.

Tested: python -m pytest tests/test_llm_core_fallback.py (3 pass);
python -m py_compile src/llm_core.py src/agent_loop.py routes/chat_routes.py;
node --check static/js/chat.js.
2026-06-02 11:37:25 +09:00
mist
1007703223 Keep no-prose assistant tool-call messages through _sanitize_llm_messages (#862)
cb13d09 made _append_tool_results emit content=None (JSON null) for a follow-up
assistant message that carries only tool_calls and no prose, because Gemini's
OpenAI-compatible endpoint and Ollama reject tool_calls alongside an
empty-string content with HTTP 400.

But _sanitize_llm_messages strips None values and then required "content" on
every message, so it dropped that assistant message entirely — leaving the
role:"tool" result dangling with no parent tool_calls, which breaks the
follow-up round for every provider (and regresses ones that accepted "" before,
since the message is now removed rather than sent). cb13d09's tests covered
_append_tool_results in isolation, so the sanitizer interaction was uncaught.

Make the sanitizer role-aware: assistant messages survive with content OR
tool_calls, and a tool-calls-only assistant message gets an explicit
content=None re-added so the provider receives spec-correct `content: null`.
tool messages still require content + tool_call_id; user/system still require
content.

Adds tests/test_llm_core_sanitize_tool_calls.py, which drives the real producer
(_append_tool_results) into the sanitizer and asserts the assistant tool-call
message survives with its tool result paired. Red before this change, green
after.
2026-06-02 11:17:22 +09:00
Ethan
fd04ad353d Add Anthropic prompt caching to the agent loop (#812)
Send `system` as a structured text block with an ephemeral cache_control
breakpoint and cache the last tool schema, so multi-round agent runs read
the stable system+tools prefix from cache instead of re-billing it. Gate
the system breakpoint so tiny tool-less prompts skip the cache-write
premium. Log cache_read/creation tokens at message_start.

Fixes #791

Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>
2026-06-02 11:14:31 +09:00
LittleLlama
54ecfa39cf Provider detection: match by hostname instead of substring (re #768) (#815)
* Dedupe URL routing helpers and tighten adjacent hostname checks

* Match providers by hostname, not substring, in _detect_provider

_detect_provider used `"anthropic.com" in url`-style substring checks, so a URL
that merely contained a provider's domain in its path or query — or a look-alike
host like `anthropic.com.example` — was misclassified and picked the wrong
auth-header/payload shape. Switch it to the existing `_host_match` helper
(hostname exact/subdomain match), the same way the human-readable labels and
curated model lists already work, finishing that migration. Also harden
`_host_match` against trailing-dot FQDNs.

Not a credential-leak fix: _detect_provider only classifies a URL the admin
already configured next to its key, and the URL — not this function — decides
where the request goes. This is a correctness/consistency cleanup.

Adds tests that import the real helpers (test_endpoint_resolver.py tests local
copies, so it can't catch this) covering the substring false-positives.

Refs #768.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Import build_headers under its real name in model_routes

It was imported as `build_headers as _provider_headers`, which collides with
the unrelated llm_core._provider_headers(provider, headers) — same name,
different signature. Use the real name to remove the confusion.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Use hostname matching in URL builders, not raw suffix checks

PR review flagged that _detect_provider() was hardened to match on
hostname, but several helpers still used raw host.endswith("anthropic.com")
/ host.endswith("ollama.com"), which match adjacent hosts like
notanthropic.com / notollama.com.

Route the remaining checks through _host_match(): _is_ollama_native_url
and _ollama_api_root in llm_core, and _anthropic_api_root / _ollama_api_root
in endpoint_resolver. With _detect_provider already hostname-correct, the
trailing "or host.endswith(...)" clauses in build_chat_url / build_models_url
are redundant, so drop them rather than fix the substring match in place.

Add builder-level tests asserting look-alike and domain-in-path hosts route
to the OpenAI-compatible default. They import the real builders and fail on
the pre-fix code.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 11:11:17 +09:00
SurprisedDuck
7268c49992 Make LLM host health maps thread-safe
The synchronous llm_call() runs in FastAPI's threadpool (sync route
handlers such as POST /sessions/auto-sort), while llm_call_async() runs
on the event loop. Both mutate the module-level _response_cache,
_host_fails and _dead_hosts dicts, so these are touched from multiple OS
threads concurrently. Two races result:

- _set_cached_response() snapshots 64 keys then deletes them with
  `del _response_cache[key]`; if another thread evicts the same key
  first, the del raises KeyError mid-eviction. Switched to
  pop(key, None).
- _mark_host_dead() does get()+1+set() on _host_fails with no lock, so
  concurrent connect failures lose increments and a genuinely dead host
  can stay under its cooldown threshold. Guarded the host-health maps
  with a threading.Lock (also applied to _is_host_dead / _clear_host_dead
  for consistent reads).

Adds tests/test_llm_core_concurrency.py with deterministic regression
tests (phantom snapshot key for the eviction race; a slow-read dict that
forces the lost-update window for the counter). Both fail on the
unpatched code and pass with the fix.
2026-06-02 05:54:23 +09:00
Areon Lundkvist
f853a3fc67 Harden streaming deltas against null payloads 2026-06-01 23:09:17 +09:00
Alexander Kenley
2c4b8b57dd feat(ai): add OpenRouter and Ollama Cloud providers (#231)
Co-authored-by: Alex Kenley <Alex.Kenley@threatvectorsecurity.com>
2026-06-01 14:26:10 +09:00
pewdiepie-archdaemon
d9d95b4855 Improve OpenRouter and Groq provider requests 2026-06-01 10:32:14 +09:00
pewdiepie-archdaemon
d026e13a5a Fix provider setup and strip message metadata 2026-06-01 10:20:18 +09:00
pewdiepie-archdaemon
fc7f107b22 Improve Ollama setup and model endpoint handling 2026-06-01 10:00:15 +09:00
pewdiepie-archdaemon
e5c99a5eee Odysseus v1.0 2026-05-31 23:58:26 +09:00