Support vLLM 0.20.2 / NIM reasoning-parser output end-to-end (surface + agent context + render) (#602)

* fix(stream): read 'reasoning' SSE field for vLLM 0.20.2 / NIM

vLLM 0.20.2 / NVIDIA NIM emit reasoning-parser output in the `reasoning` delta field; older builds use `reasoning_content`. stream_llm() read only the latter, so reasoning from models like Nemotron-3-Nano (--reasoning-parser) was silently dropped and never rendered. Accept either field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent): keep reasoning_content only on the latest assistant turn

The agent loop echoed each round's reasoning back as `reasoning_content` on every assistant turn, assuming vendors ignore it. Nemotron's chat template re-injects ALL prior reasoning_content as <think> blocks, and the loop is trimmed only once (before it starts) — so reasoning accumulated unbounded across rounds, bloating context and feeding the model its own prior reasoning, which reinforced repetition/looping. Strip reasoning_content from earlier assistant turns so only the most recent round carries it (still satisfies DeepSeek's thinking-mode follow-up requirement).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent-ui): wrap each round's reasoning in its own <think> block

The streamed think-tag wrapper gated on whole-message substring checks (accumulated.includes('<think>')), which only ever wrapped ONE reasoning block per message. A multi-round agent response has a reasoning phase per round, so once round 1 closed its <think>...</think>, rounds 2+ reasoning was emitted unwrapped and leaked into the visible answer. Replace the substring checks with a stateful open/close flag that toggles per think/answer cycle, so each round's reasoning gets its own collapsible block. Single-turn chat is unchanged (one open, one close).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(stream): reasoning/reasoning_content delta surfaces as thinking chunk

Covers @pewdiepie-archdaemon's requested regression: a streamed {reasoning: ...} delta emits a thinking chunk while {content: ...} streams as normal content; plus the older reasoning_content field for backward compat. Mirrors the #591 scenario.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
nsgds
2026-06-02 10:48:17 +08:00
committed by GitHub
parent a857d2016d
commit 5645cce6d0
4 changed files with 124 additions and 7 deletions

View File

@@ -512,6 +512,10 @@ import createResearchSynapse from './researchSynapse.js';
// Declare accumulated outside try block so it's accessible in catch
let accumulated = '';
// Are we currently inside an unclosed <think> block? Toggled per think/answer
// cycle so a multi-round agent response (one reasoning phase PER round) wraps each
// round's reasoning in its own <think>…</think> instead of leaking rounds 2+ as text.
let _thinkOpen = false;
let holder = null;
let finalMeta = null;
let finalModelName = null;
@@ -1357,12 +1361,15 @@ import createResearchSynapse from './researchSynapse.js';
if (_threadAbove && _threadAbove.classList.contains('agent-thread') && !_threadAbove.classList.contains('has-bottom')) {
_threadAbove.classList.add('has-bottom');
}
// VLLM reasoning tokens: wrap in <think> tags for the thinking UI
// VLLM reasoning tokens: wrap in <think> tags for the thinking UI.
// Stateful open/close (not a whole-message substring check) so each round
// of a multi-round agent response gets its own <think>…</think> — otherwise
// only round 1 is wrapped and rounds 2+ reasoning leaks into the answer.
let _delta = json.delta;
if (json.thinking) {
if (!accumulated.includes('<think>')) _delta = '<think>' + _delta;
} else if (accumulated.includes('<think>') && !accumulated.includes('</think>')) {
_delta = '</think>' + _delta;
if (!_thinkOpen) { _delta = '<think>' + _delta; _thinkOpen = true; }
} else if (_thinkOpen) {
_delta = '</think>' + _delta; _thinkOpen = false;
}
const wasEmpty = !accumulated;
accumulated += _delta;