cookbook agent debug loop: persistent log files, auto-adopt orphan tmux, Codex/Claude skill parity

Three converging fixes so the chat agent + external Codex/Claude skills can actually debug a crashed serve instead of staring at a post-crash neofetch banner:

* Serves now `tee` to /tmp/odysseus-tmux/SESSION.log on the host running them. Runner saves fds 3/4 before the tee and restores them right before `exec ${SHELL}`, so the post-crash interactive zsh banner does NOT pollute the log file.
* `tail_serve_output` (chat agent) and `/api/codex/cookbook/output/{sid}` (Codex+Claude skills) both prefer the persistent log file over the tmux pane. Pane is fallback for sessions predating the tee runner. Default tail bumped 150 -> 400.
* `list_served_models` "recent log" snippet seeks to the Traceback line instead of showing the last 6 lines (which was always the bash prompt).

Cookbook auto-adoption sweep on `/api/cookbook/tasks/status`: every 20s (rate-limited) the cookbook SSHes each configured server, finds `serve-*` / `cookbook-*` tmux sessions running an actual model process (vllm/python/llama-server/etc., filtered via `pane_current_command`), and writes them into state.tasks. So when the agent falls back to raw ssh+tmux, the session appears in the Cookbook UI on the next poll.

`serve_model` error path now reads `data["detail"]` in addition to `data["error"]` so the FastAPI HTTPException message ("Invalid characters in cmd") actually reaches the agent instead of being swallowed as a generic "Serve failed". Tool description updated to warn against `cd …`/`source …`/`&&` prefixes.

Intent-without-action supervisor in agent_loop: when the model writes "Let me tail the output" / "I'll check the logs" / "Let me investigate" and ends the turn without emitting a tool call, the loop injects a sharp system nudge ("You said you would X — DO IT NOW") and continues. Capped at 2 nudges per chat so a model that genuinely cannot use the tool does not pin the loop.

Codex/Claude skill parity: adds `/cookbook/cached`, `/cookbook/presets`, `/cookbook/preset/{name}`, `/cookbook/adopt` so external agents have the same surface as the chat agent. SKILL.md docs + odysseus_api.py wrapper updated for both bundles.

`adopt_served_model` promoted to the always-on tool set so the agent has a documented fallback when serve_model rejects a cmd.

Also various cookbook UI tweaks accumulated alongside the above (cookbook.js, cookbookRunning.js, cookbookServe.js, cookbook-diagnosis.js, settings.js, style.css).
This commit is contained in:
pewdiepie-archdaemon
2026-06-04 23:27:18 +09:00
parent 041c03bf11
commit 9112861d8e
19 changed files with 1529 additions and 151 deletions

View File

@@ -330,6 +330,7 @@ If the user asks for a reminder/alarm before the event, pass `reminder_minutes`
"ui_control": "- ```ui_control``` — Control the UI: toggle tools on/off, OPEN PANELS, open email reply drafts, switch models, change themes. Commands: `toggle <name> on/off` (names: bash/shell, web/search, research, incognito, document_editor/documents), `open_panel <name>` (panels: documents, gallery, email, sessions, notes, memories/brain, skills, settings, cookbook), `open_email_reply <uid> <folder> <reply|reply-all|ai-reply>` (opens an email compose document, does NOT send), `set_mode agent/chat`, `switch_model <name>`, `set_theme <preset>`, `create_theme <name> <bg> <fg> <panel> <border> <accent>` (optional key=val for advanced colors AND background effects: bgPattern=<none|dots|synapse|rain|constellations|perlin-flow|petals|sparkles|embers>, bgEffectColor=#RRGGBB, bgEffectIntensity=<num>, bgEffectSize=<num>, frosted=true|false). \"open documents\" / \"open library\" / \"show gallery\" / \"open inbox\" / \"open notes\" / \"open cookbook\" all map to `open_panel <name>`. Theme presets: dark, light, midnight, paper, cyberpunk, retrowave, forest, ocean, ume, copper, terminal, organs, lavender, gpt, claude, cute.",
"list_served_models": "- ```list_served_models``` — Show what the Cookbook (LLM-serving subsystem) is currently running. NO args. Use this for ANY 'what's running' / 'what's serving' / 'show my cookbook' / 'is anything up' query. DO NOT shell out (`ps aux`, `docker ps`, etc.) — this tool is the source of truth. Failed serve tasks include recent logs plus diagnosis/retry suggestions; use those suggestions to call `serve_model` again with an adjusted command when appropriate.",
"stop_served_model": "- ```stop_served_model``` — Stop a running model server. Args (JSON): {\"session_id\": \"<from list_served_models>\"}. Use for 'kill my cookbook' / 'stop the model' / 'shut down vLLM'.",
"tail_serve_output": "- ```tail_serve_output``` — Read the actual tmux stderr/traceback of a CURRENTLY failing cookbook task. Args (JSON): {\"session_id\": \"<from list_served_models>\", \"tail\": 150?}. **Use ONLY after** you just launched something via `serve_model` AND `list_served_models` reports YOUR new task as `crashed`/`error`. DO NOT use it on old stopped/completed download tasks (they're historical noise — won't predict whether a new launch succeeds). DO NOT call it before launching a fresh attempt. When you do call it, bump `tail` to 400+ only if the visible error references 'see root cause above'.",
"download_model": "- ```download_model``` — Download a HuggingFace model. Args (JSON): {\"repo_id\": \"Qwen/Qwen3-8B\", \"host\": \"user@gpu-box\"?, \"include\": \"*Q4_K_M*\"?}.",
"serve_model": "- ```serve_model``` — Start serving a model with vLLM / SGLang / llama.cpp / Ollama / Diffusers. Args (JSON): {\"repo_id\": \"...\", \"cmd\": \"vllm serve ... --port 8000\" or \"python3 -m sglang.launch_server ... --port 30000\" or \"python3 scripts/diffusion_server.py --model diffusers/stable-diffusion-xl-1.0-inpainting-0.1 --port 8100\", \"host\": \"user@gpu-box\"?}. For image/inpaint/diffusion models, use the `scripts/diffusion_server.py` command exactly. After launch, call `list_served_models`; if it returns a diagnosis with an adjusted command, retry with that command.",
"list_downloads": "- ```list_downloads``` — Show in-progress HuggingFace model downloads (filters Cookbook tasks/status to downloads only). NO args. Use for 'what's downloading' / 'show my downloads' / 'check download progress'.",
@@ -1646,6 +1647,28 @@ async def stream_agent_loop(
_tool_type_counts: collections.Counter = collections.Counter()
_THINK_RE = re.compile(r'<think>.*?</think>', re.DOTALL | re.IGNORECASE)
_force_answer = False # set by loop-breaker → next round runs with NO tools
# Supervisor: how many times we've nudged the model after it announced
# an action without emitting the tool call. Capped to prevent a model
# that *can't* call the tool from looping forever.
_intent_nudge_count = 0
_MAX_INTENT_NUDGES = 2
# "I said I would, then didn't" detector. The pattern that breaks debug
# loops on weak models (deepseek-v4-flash mid-2026): the model writes
# "Let me tail the output to see the error" and then ends the turn with
# no tool_calls. The intent is sincere but the function call gets dropped.
# Match the common phrasings + an action verb that maps to an available
# tool, so we don't nudge on harmless transitional text like "let me
# know what you think".
_INTENT_RE = re.compile(
r"(?:^|\n)\s*(?:let me|i'?ll|i will|going to|let's)\s+"
r"(?:tail|check|investigate|look at|see|tail|read|fetch|inspect|"
r"verify|diagnose|examine|debug|capture|grab|pull|view|run|call|"
r"trigger|launch|start|kick off|stop|kill|restart|adopt|serve|"
r"register|adopt|list|search|find|query|hit|ping|test)"
r"\b[^.\n]{0,140}",
re.IGNORECASE,
)
# Document streaming state (persists across rounds)
_doc_acc = "" # accumulated tool-call JSON arguments
@@ -1992,6 +2015,46 @@ async def stream_agent_loop(
# never re-verify an unchanged state in a loop.
_effectful_used = False
continue
# ── Intent-without-action supervisor ─────────────────────
# Catch "Let me tail the output" / "I'll check the logs" /
# "Let me investigate" patterns where the model announces an
# action but emits no tool_call. The bug shows up most on
# smaller models trained to verbalize plans before acting.
# We inject one sharp nudge ("you said you would X — call the
# actual tool now") and loop again. Capped at
# _MAX_INTENT_NUDGES so a model that genuinely cannot use the
# tool doesn't pin us in a forever loop.
_intent_text = _THINK_RE.sub("", cleaned_round).strip()
_intent_match = _INTENT_RE.search(_intent_text) if _intent_text else None
# Only nudge when the round REALLY looks like an unfinished
# promise: short response (<400 chars), no fenced code/answer,
# and an action-intent phrase was matched. Long answers that
# happen to contain "let me know" are not stalls.
_looks_like_promise = (
_intent_match is not None
and len(_intent_text) < 400
and "```" not in _intent_text
and _intent_nudge_count < _MAX_INTENT_NUDGES
)
if _looks_like_promise:
_intent_nudge_count += 1
_matched_phrase = _intent_match.group(0).strip()
logger.info(f"[agent] intent-without-action nudge #{_intent_nudge_count} on round {round_num}: {_matched_phrase!r}")
messages.append({
"role": "system",
"content": (
f"You just wrote: \"{_matched_phrase}\" — but ended the "
"turn without making the actual tool call. The user can "
"see you announced the action but didn't run it, which "
"is the most frustrating thing you can do. "
"DO IT NOW: emit the actual function call this turn. "
"If you decided not to do it after all, say so plainly in "
"one sentence instead of restating the plan."
),
})
# Visible signal in the stream so the user knows we caught it.
yield f'data: {json.dumps({"type": "agent_step", "round": round_num + 1})}\n\n'
continue
break # no tools — done
# ── Loop-breaker (Terminus-style stall detector) ──────────────