_build_ollama_payload sends options.temperature and options.num_predict
to /api/chat, but never options.num_ctx. Ollama defaults num_ctx to 2048
when the option is omitted, so prompts going to any Ollama backend are
silently truncated there regardless of the model's actual capability.
Thread the discovered context length through the three call sites
(llm_call, llm_call_async, stream_llm) and emit options.num_ctx when it
is known and positive. The builder filters out the DEFAULT_CONTEXT
fallback (128000) so we don't lie to Ollama about models whose window
we couldn't actually discover. The issue's literal 'when > 2048'
heuristic is dropped: a model with a real context smaller than 2048
would OOM if Ollama used its default, so we pass the real value
regardless of size. Matches how src/context_compactor.py uses the
same helper.
Sister fix to PR #753 — that PR teaches the compactor the right budget,
this one tells Ollama to actually use that budget on the way in.
* fix(stream): read 'reasoning' SSE field for vLLM 0.20.2 / NIM
vLLM 0.20.2 / NVIDIA NIM emit reasoning-parser output in the `reasoning` delta field; older builds use `reasoning_content`. stream_llm() read only the latter, so reasoning from models like Nemotron-3-Nano (--reasoning-parser) was silently dropped and never rendered. Accept either field.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(agent): keep reasoning_content only on the latest assistant turn
The agent loop echoed each round's reasoning back as `reasoning_content` on every assistant turn, assuming vendors ignore it. Nemotron's chat template re-injects ALL prior reasoning_content as <think> blocks, and the loop is trimmed only once (before it starts) — so reasoning accumulated unbounded across rounds, bloating context and feeding the model its own prior reasoning, which reinforced repetition/looping. Strip reasoning_content from earlier assistant turns so only the most recent round carries it (still satisfies DeepSeek's thinking-mode follow-up requirement).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(agent-ui): wrap each round's reasoning in its own <think> block
The streamed think-tag wrapper gated on whole-message substring checks (accumulated.includes('<think>')), which only ever wrapped ONE reasoning block per message. A multi-round agent response has a reasoning phase per round, so once round 1 closed its <think>...</think>, rounds 2+ reasoning was emitted unwrapped and leaked into the visible answer. Replace the substring checks with a stateful open/close flag that toggles per think/answer cycle, so each round's reasoning gets its own collapsible block. Single-turn chat is unchanged (one open, one close).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(stream): reasoning/reasoning_content delta surfaces as thinking chunk
Covers @pewdiepie-archdaemon's requested regression: a streamed {reasoning: ...} delta emits a thinking chunk while {content: ...} streams as normal content; plus the older reasoning_content field for backward compat. Mirrors the #591 scenario.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The agent's multi-round (tool-result) follow-up request was rejected with
HTTP 400 on two providers, so tools ran but the agent never produced an answer:
- OpenAI-compatible streaming (Gemini 3) dropped the per-call thought_signature
and collided parallel tool calls, which arrive with index=None: they all
landed in slot 0, overwriting the first call's name and corrupting its
arguments by concatenation, so the follow-up request 400'd. Capture and replay
each call's extra_content (thought_signature), and give every parallel call
its own accumulator slot (allocated above the max key, so sparse or mixed
indices can't collide).
- Native Ollama /api/chat expects object tool-call arguments, but Odysseus
carries them as a JSON string, which Ollama rejected ("Value looks like
object, but can't find closing '}' symbol"). Convert them to objects in the
Ollama payload builder.
Both compose with the no-prose null-content sanitize fix from #862.
Tested: python -m pytest tests/test_llm_core_streaming.py
tests/test_llm_core_ollama.py tests/test_agent_loop.py (53 pass), and
python -m py_compile src/llm_core.py src/agent_loop.py.
When the selected model fails before producing output, stream_llm_with_fallback
quietly switches to the next candidate and the reply is shown under the
originally selected model's name, so a misconfigured provider looks like it
works. (Concretely: a Bedrock gateway that 400s every Anthropic/Claude request
appears fine because another model silently answers under the Claude label.)
Emit a `fallback` SSE event ({selected_model, answered_by, reason}) the first
time a non-primary candidate produces output, forward it through the agent loop
and both chat-route paths, stamp the response metrics with the model that
actually answered, and show a notice + relabel the reply in the UI.
Tested: python -m pytest tests/test_llm_core_fallback.py (3 pass);
python -m py_compile src/llm_core.py src/agent_loop.py routes/chat_routes.py;
node --check static/js/chat.js.
cb13d09 made _append_tool_results emit content=None (JSON null) for a follow-up
assistant message that carries only tool_calls and no prose, because Gemini's
OpenAI-compatible endpoint and Ollama reject tool_calls alongside an
empty-string content with HTTP 400.
But _sanitize_llm_messages strips None values and then required "content" on
every message, so it dropped that assistant message entirely — leaving the
role:"tool" result dangling with no parent tool_calls, which breaks the
follow-up round for every provider (and regresses ones that accepted "" before,
since the message is now removed rather than sent). cb13d09's tests covered
_append_tool_results in isolation, so the sanitizer interaction was uncaught.
Make the sanitizer role-aware: assistant messages survive with content OR
tool_calls, and a tool-calls-only assistant message gets an explicit
content=None re-added so the provider receives spec-correct `content: null`.
tool messages still require content + tool_call_id; user/system still require
content.
Adds tests/test_llm_core_sanitize_tool_calls.py, which drives the real producer
(_append_tool_results) into the sanitizer and asserts the assistant tool-call
message survives with its tool result paired. Red before this change, green
after.
Send `system` as a structured text block with an ephemeral cache_control
breakpoint and cache the last tool schema, so multi-round agent runs read
the stable system+tools prefix from cache instead of re-billing it. Gate
the system breakpoint so tiny tool-less prompts skip the cache-write
premium. Log cache_read/creation tokens at message_start.
Fixes#791
Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>
* Dedupe URL routing helpers and tighten adjacent hostname checks
* Match providers by hostname, not substring, in _detect_provider
_detect_provider used `"anthropic.com" in url`-style substring checks, so a URL
that merely contained a provider's domain in its path or query — or a look-alike
host like `anthropic.com.example` — was misclassified and picked the wrong
auth-header/payload shape. Switch it to the existing `_host_match` helper
(hostname exact/subdomain match), the same way the human-readable labels and
curated model lists already work, finishing that migration. Also harden
`_host_match` against trailing-dot FQDNs.
Not a credential-leak fix: _detect_provider only classifies a URL the admin
already configured next to its key, and the URL — not this function — decides
where the request goes. This is a correctness/consistency cleanup.
Adds tests that import the real helpers (test_endpoint_resolver.py tests local
copies, so it can't catch this) covering the substring false-positives.
Refs #768.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Import build_headers under its real name in model_routes
It was imported as `build_headers as _provider_headers`, which collides with
the unrelated llm_core._provider_headers(provider, headers) — same name,
different signature. Use the real name to remove the confusion.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Use hostname matching in URL builders, not raw suffix checks
PR review flagged that _detect_provider() was hardened to match on
hostname, but several helpers still used raw host.endswith("anthropic.com")
/ host.endswith("ollama.com"), which match adjacent hosts like
notanthropic.com / notollama.com.
Route the remaining checks through _host_match(): _is_ollama_native_url
and _ollama_api_root in llm_core, and _anthropic_api_root / _ollama_api_root
in endpoint_resolver. With _detect_provider already hostname-correct, the
trailing "or host.endswith(...)" clauses in build_chat_url / build_models_url
are redundant, so drop them rather than fix the substring match in place.
Add builder-level tests asserting look-alike and domain-in-path hosts route
to the OpenAI-compatible default. They import the real builders and fail on
the pre-fix code.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
The synchronous llm_call() runs in FastAPI's threadpool (sync route
handlers such as POST /sessions/auto-sort), while llm_call_async() runs
on the event loop. Both mutate the module-level _response_cache,
_host_fails and _dead_hosts dicts, so these are touched from multiple OS
threads concurrently. Two races result:
- _set_cached_response() snapshots 64 keys then deletes them with
`del _response_cache[key]`; if another thread evicts the same key
first, the del raises KeyError mid-eviction. Switched to
pop(key, None).
- _mark_host_dead() does get()+1+set() on _host_fails with no lock, so
concurrent connect failures lose increments and a genuinely dead host
can stay under its cooldown threshold. Guarded the host-health maps
with a threading.Lock (also applied to _is_host_dead / _clear_host_dead
for consistent reads).
Adds tests/test_llm_core_concurrency.py with deterministic regression
tests (phantom snapshot key for the eviction race; a slow-read dict that
forces the lost-update window for the counter). Both fail on the
unpatched code and pass with the fix.