Support vLLM 0.20.2 / NIM reasoning-parser output end-to-end (surface + agent context + render) (#602)
* fix(stream): read 'reasoning' SSE field for vLLM 0.20.2 / NIM vLLM 0.20.2 / NVIDIA NIM emit reasoning-parser output in the `reasoning` delta field; older builds use `reasoning_content`. stream_llm() read only the latter, so reasoning from models like Nemotron-3-Nano (--reasoning-parser) was silently dropped and never rendered. Accept either field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agent): keep reasoning_content only on the latest assistant turn The agent loop echoed each round's reasoning back as `reasoning_content` on every assistant turn, assuming vendors ignore it. Nemotron's chat template re-injects ALL prior reasoning_content as <think> blocks, and the loop is trimmed only once (before it starts) — so reasoning accumulated unbounded across rounds, bloating context and feeding the model its own prior reasoning, which reinforced repetition/looping. Strip reasoning_content from earlier assistant turns so only the most recent round carries it (still satisfies DeepSeek's thinking-mode follow-up requirement). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agent-ui): wrap each round's reasoning in its own <think> block The streamed think-tag wrapper gated on whole-message substring checks (accumulated.includes('<think>')), which only ever wrapped ONE reasoning block per message. A multi-round agent response has a reasoning phase per round, so once round 1 closed its <think>...</think>, rounds 2+ reasoning was emitted unwrapped and leaked into the visible answer. Replace the substring checks with a stateful open/close flag that toggles per think/answer cycle, so each round's reasoning gets its own collapsible block. Single-turn chat is unchanged (one open, one close). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(stream): reasoning/reasoning_content delta surfaces as thinking chunk Covers @pewdiepie-archdaemon's requested regression: a streamed {reasoning: ...} delta emits a thinking chunk while {content: ...} streams as normal content; plus the older reasoning_content field for backward compat. Mirrors the #591 scenario. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1101,8 +1101,20 @@ def _append_tool_results(
|
||||
`round_reasoning` (DeepSeek / vLLM reasoning-parser deltas) is echoed
|
||||
back via `reasoning_content` on the assistant message — DeepSeek's API
|
||||
rejects follow-up requests in thinking mode that don't include the
|
||||
prior reasoning. Other vendors ignore the extra field.
|
||||
prior reasoning.
|
||||
|
||||
NOTE: it is NOT universally ignored. Nemotron's chat template re-injects
|
||||
EVERY prior `reasoning_content` as a <think> block, and this agent loop is
|
||||
trimmed only once (before the loop), so across rounds the reasoning piles
|
||||
up unbounded — bloating context and feeding the model its own prior
|
||||
reasoning, which reinforces repetition/looping. So keep reasoning_content
|
||||
on the MOST RECENT assistant turn only: enough for DeepSeek continuity,
|
||||
without the per-round accumulation.
|
||||
"""
|
||||
# Strip reasoning_content from earlier assistant turns; only the newest keeps it.
|
||||
for _m in messages:
|
||||
if _m.get("role") == "assistant":
|
||||
_m.pop("reasoning_content", None)
|
||||
if used_native and native_tool_calls:
|
||||
assistant_msg = {"role": "assistant"}
|
||||
# When the model emitted ONLY tool calls (no prose), content must be
|
||||
|
||||
Reference in New Issue
Block a user