* fix(llm): auto-detect <think> in content stream for unregistered thinking models
_THINKING_MODEL_PATTERNS only covers known model families by name. Qwen3-derived
models with non-standard names (e.g. Qwopus, custom QwQ forks) are not matched,
so their <think>...</think> content streams through as visible chat text instead
of being routed to the thinking display.
When the first content delta opens with <think> and the model was not already
identified as a thinking model, dynamically flag the stream as a thinking model
for the remainder of the response. This enables the existing </think> repair path
(line below) and ensures the frontend receives the full <think>...</think> wrapper
it needs to split thinking from the final answer.
The check is restricted to the very first content delta (_first_content_sent is
False) to avoid misidentifying models that happen to write "<think>" mid-answer.
Fixes#2225
Related: #2420 (covered by separate PR from @AmmarS-Analyst), #2224 (@RaresKeY)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(llm): replace inert _thinking_model flag with _in_think_tag state machine
The original auto-detect set _thinking_model=True on the first <think> chunk
but still emitted it as a regular delta and set _first_content_sent=True
immediately, so no subsequent chunk could enter the repair path.
Replace with _in_think_tag bool: enter thinking mode when first content starts
with <think>, route all chunks to the thinking channel until </think> is found,
then the tail becomes the first regular delta. Adds three regression tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(llm): replace _first_content_sent guard with _think_open_stripped
Opening-tag stripping used `not _first_content_sent` as the guard, but
_first_content_sent stays False throughout the entire think block (it only
flips when regular content is emitted). So `find(">")` ran on every
reasoning chunk — not just the first — and silently truncated everything
before the first ">" in any reasoning text containing comparisons, arrows,
or code.
Fix: add `_think_open_stripped = False` alongside `_in_think_tag`. Use it
as the strip guard in both the "still inside <think>" path and the
"</think> found in same chunk" split path. Set it True once the opening
tag is consumed so all subsequent chunks reach the thinking channel
unmolested.
Add regression test: 3-chunk stream where the middle chunk contains
"c > d" — confirms "more c " is not dropped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(stream): read 'reasoning' SSE field for vLLM 0.20.2 / NIM
vLLM 0.20.2 / NVIDIA NIM emit reasoning-parser output in the `reasoning` delta field; older builds use `reasoning_content`. stream_llm() read only the latter, so reasoning from models like Nemotron-3-Nano (--reasoning-parser) was silently dropped and never rendered. Accept either field.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(agent): keep reasoning_content only on the latest assistant turn
The agent loop echoed each round's reasoning back as `reasoning_content` on every assistant turn, assuming vendors ignore it. Nemotron's chat template re-injects ALL prior reasoning_content as <think> blocks, and the loop is trimmed only once (before it starts) — so reasoning accumulated unbounded across rounds, bloating context and feeding the model its own prior reasoning, which reinforced repetition/looping. Strip reasoning_content from earlier assistant turns so only the most recent round carries it (still satisfies DeepSeek's thinking-mode follow-up requirement).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(agent-ui): wrap each round's reasoning in its own <think> block
The streamed think-tag wrapper gated on whole-message substring checks (accumulated.includes('<think>')), which only ever wrapped ONE reasoning block per message. A multi-round agent response has a reasoning phase per round, so once round 1 closed its <think>...</think>, rounds 2+ reasoning was emitted unwrapped and leaked into the visible answer. Replace the substring checks with a stateful open/close flag that toggles per think/answer cycle, so each round's reasoning gets its own collapsible block. Single-turn chat is unchanged (one open, one close).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(stream): reasoning/reasoning_content delta surfaces as thinking chunk
Covers @pewdiepie-archdaemon's requested regression: a streamed {reasoning: ...} delta emits a thinking chunk while {content: ...} streams as normal content; plus the older reasoning_content field for backward compat. Mirrors the #591 scenario.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>