Surface silent model fallback instead of masking it (#868)

When the selected model fails before producing output, stream_llm_with_fallback quietly switches to the next candidate and the reply is shown under the originally selected model's name, so a misconfigured provider looks like it works. (Concretely: a Bedrock gateway that 400s every Anthropic/Claude request appears fine because another model silently answers under the Claude label.) Emit a `fallback` SSE event ({selected_model, answered_by, reason}) the first time a non-primary candidate produces output, forward it through the agent loop and both chat-route paths, stamp the response metrics with the model that actually answered, and show a notice + relabel the reply in the UI. Tested: python -m pytest tests/test_llm_core_fallback.py (3 pass); python -m py_compile src/llm_core.py src/agent_loop.py routes/chat_routes.py; node --check static/js/chat.js.
2026-06-02 04:37:25 +02:00
parent 2d6b777799
commit 6776c7d691
5 changed files with 135 additions and 2 deletions
--- a/src/agent_loop.py
+++ b/src/agent_loop.py
@@ -1638,6 +1638,12 @@ async def stream_agent_loop(
                        real_output_tokens += u.get("output_tokens", 0)
                        last_round_input_tokens = round_input
                        has_real_usage = True
+                    elif data.get("type") == "fallback":
+                        # The selected model failed and another answered; surface
+                        # the notice so a misconfigured provider isn't masked.
+                        logger.warning(f"[agent] round {round_num} fell back: "
+                                       f"{data.get('selected_model')} -> {data.get('answered_by')}")
+                        yield chunk
                    elif "delta" in data:
                        if not first_token_received:
                            time_to_first_token = time.time() - total_start