fix(cookbook): surface backend diagnosis when serve fails in background (#1636)

* refactor(cookbook): move _diagnose_serve_output to module level in cookbook_helpers Extracts the nested _diagnose_serve_output function from inside setup_cookbook_routes() and moves it to module level in cookbook_helpers.py, alongside the other helper functions it logically belongs with. No behaviour change — the function is now importable directly for testing and by other callers without going through the route factory closure. * fix(cookbook): surface backend diagnosis when serve fails in background The background poll (_pollBackgroundStatus) already received `diagnosis` and `cmd` from /api/cookbook/tasks/status but discarded both. When a serve job died while the Cookbook modal was closed, reopening it showed only a red error badge with no context. - Persist live.diagnosis into task._backendDiagnosis in localStorage so it survives modal close/reopen and page refresh - Persist live.cmd into task.payload._cmd for agent-spawned tasks so the crash report includes the actual command - After _renderRunningTab(), walk rendered cards and call _showDiagnosis() for any that have a stored _backendDiagnosis but no panel yet - In _renderTaskCard(), use _backendDiagnosis as a fallback when the client-side _terminalServeDiagnosis() finds nothing * test(cookbook): add coverage for _diagnose_serve_output error patterns 10 tests verifying the 16 serve-failure patterns: - CUDA OOM, port-in-use, vLLM missing, gated model - Traceback fallback fires without startup success marker - Traceback suppressed when server actually started - Clean/empty output returns None - trust-remote-code and no-GGUF patterns
2026-06-05 05:52:07 -03:00
parent 367858a587
commit f5d834b0c5
4 changed files with 210 additions and 122 deletions
--- a/tests/test_cookbook_error_feedback.py
+++ b/tests/test_cookbook_error_feedback.py
@@ -0,0 +1,72 @@
+from routes.cookbook_helpers import _diagnose_serve_output
+
+
+def test_cuda_oom_returns_diagnosis():
+    out = "torch.cuda.OutOfMemoryError: CUDA out of memory."
+    result = _diagnose_serve_output(out)
+    assert result is not None
+    assert "memory" in result["message"].lower()
+    assert any(s["op"] == "replace" for s in result["suggestions"])
+
+
+def test_port_in_use_returns_diagnosis():
+    out = "OSError: [Errno 98] Address already in use"
+    result = _diagnose_serve_output(out)
+    assert result is not None
+    assert "port" in result["message"].lower()
+    assert result["suggestions"][0]["flag"] == "--port"
+
+
+def test_vllm_not_installed_returns_diagnosis():
+    out = "No module named vllm"
+    result = _diagnose_serve_output(out)
+    assert result is not None
+    assert "vLLM" in result["message"]
+    assert result["suggestions"][0]["package"] == "vllm"
+
+
+def test_gated_model_returns_diagnosis():
+    out = "403 Forbidden\nAccess to model is restricted"
+    result = _diagnose_serve_output(out)
+    assert result is not None
+    assert "gated" in result["message"].lower() or "unauthorized" in result["message"].lower()
+
+
+def test_traceback_fallback_fires_without_startup_success():
+    out = "Traceback (most recent call last):\n  File 'serve.py', line 1\nRuntimeError: bad config"
+    result = _diagnose_serve_output(out)
+    assert result is not None
+    assert "traceback" in result["message"].lower()
+
+
+def test_traceback_suppressed_when_server_started():
+    out = (
+        "Traceback (most recent call last):\n  File 'x.py'\nValueError: ...\n"
+        "Application startup complete."
+    )
+    result = _diagnose_serve_output(out)
+    assert result is None
+
+
+def test_clean_output_returns_none():
+    out = "INFO: Application startup complete.\nINFO: Uvicorn running on http://0.0.0.0:8000"
+    assert _diagnose_serve_output(out) is None
+
+
+def test_empty_input_returns_none():
+    assert _diagnose_serve_output("") is None
+    assert _diagnose_serve_output(None) is None
+
+
+def test_trust_remote_code_pattern():
+    out = "Please pass trust_remote_code=True when loading this model."
+    result = _diagnose_serve_output(out)
+    assert result is not None
+    assert "--trust-remote-code" in result["suggestions"][0]["arg"]
+
+
+def test_no_gguf_found_pattern():
+    out = "No GGUF found on this host for model qwen/qwen2-7b"
+    result = _diagnose_serve_output(out)
+    assert result is not None
+    assert "GGUF" in result["message"]