* feat(mcp): add Streamable HTTP transport with OAuth 2.0
Odysseus could only reach MCP servers over stdio and SSE, so modern
remote servers like https://mcp.higgsfield.ai/mcp (Streamable HTTP,
gated behind OAuth) could not be connected.
Add an `http` transport that connects via the SDK's
streamablehttp_client and authenticates with the SDK's
OAuthClientProvider: RFC 9728 protected-resource discovery, RFC 8414
authorization-server metadata, Dynamic Client Registration,
authorization-code + PKCE, and token refresh. A small bridge
(src/mcp_oauth.py) connects the SDK's blocking callback to the existing
web callback route via an asyncio.Future keyed by the OAuth `state`,
and the dynamic client registration plus tokens persist per-server in a
new encrypted `oauth_tokens` column.
The connect runs as a bounded background task so the "Add server"
request returns immediately; redirect_handler publishes needs_auth +
auth_url to connection state as soon as discovery/DCR completes (which
can exceed the bounded wait), and the UI polls until connected. Remote
users finish via the existing paste-back flow. The Google OAuth path is
left unchanged.
- core/database.py: encrypted oauth_tokens column + migration
- src/mcp_oauth.py: OAuth provider, DB-backed TokenStorage, state registry
- src/mcp_manager.py: http dispatch, background connect, _connect_http
- routes/mcp_routes.py: http validation, needs_auth/auth_url, callback bridge
- static/js/settings.js: Streamable HTTP option + OAuth flow with polling
- tests: 5 new unit tests (transport dispatch, registry, token storage)
Verified against the live Higgsfield server: discovery, DCR (client_id
issued), loopback redirect accepted, and a PKCE authorization URL with
needs_auth status. No regressions (full suite delta is only the 5 added
passing tests).
* fix(mcp): address PR #1033 review feedback
- mcp_oauth: derive redirect URI from OAUTH_REDIRECT_BASE_URL/APP_PUBLIC_URL
(default http://localhost:7000) instead of hardcoding the port
- mcp_oauth: leave OAuth scope unset so the SDK derives it from the server's
WWW-Authenticate/protected-resource metadata; hardcoding an OIDC scope broke
non-OpenID MCP servers (verified: Higgsfield still gets its server-derived
scope)
- mcp_oauth: prune abandoned OAuth flows (_prune_stale + _pending_ts) so the
module-level registries can't grow unbounded
- mcp_oauth: persist tokens/client-info in a single DB session/commit
(_update) instead of a load+save double round-trip
- mcp_manager: cancel and drop the background connect task in
disconnect_server so a deleted server stops publishing status
- database: document why the oauth_tokens migration uses TEXT while the model
declares EncryptedText (encryption is applied at the Python layer)
- settings.js: surface persistent OAuth-poll failures and an explicit timeout
message instead of silently swallowing errors
- tests: cover the stale-flow pruning
* static/js/settings.js now shows an in-flight loading state on the buttons that fire requests:
Chat models often emit GitHub/Slack-style :shortcode: text (e.g. 😊,
🎤) instead of the actual emoji. The renderer only converted real
Unicode emoji to the monochrome line icons, so shortcodes rendered as literal
text.
Add a pure, browser-free shortcode->Unicode map (emojiShortcodes.js) and run it
inside svgifyEmoji ahead of the existing Unicode->SVG pass, skipping <code>/<pre>
so code stays literal. Covers ~430 common shortcodes plus common aliases
(+1/thumbsup, etc.).
Keep the conversion from touching anything it shouldn't:
* Scope it to chat. mdToHtml/svgifyEmoji take a { shortcodes } option (default
on); document and email body rendering (compose, export, preview) pass it as
false so author-typed :shortcode: text stays literal. The Unicode->SVG pass
still runs there exactly as before.
* Only convert a :shortcode: that stands on its own. A word-boundary guard
leaves embedded colon runs alone, so "1:100:2", "10:30:45", "16:9" and
host:fire:port are never rewritten.
Tests: extend the node-driven unit test with the boundary/false-positive cases,
and fix the markdown-rendering test loader to resolve the new emojiShortcodes
import.
Makes the cookbook venv fallback-chain test deterministic by simulating the inside-venv shell state directly instead of depending on the GitHub runner Python environment. Final focused #2580 CI-baseline cleanup.
Calls execute_tool_block through the live src.tool_execution module in the edit-file admin-gate test so the monkeypatched _owner_is_admin seam and the called function belong to the same module object. Fixes the scoped #2580 CI-order edit-file failure. Remaining Python failure is the unrelated cookbook fallback-chain environment test.
* feat: Add workspace: confine agent tools to a folder
Pick a server folder as the agent's workspace so its file/shell tools work
there and don't touch files outside it. File tools are hard-confined; bash/
python run with cwd set to the folder.
Includes a slash command: `/workspace` (alias `/ws`) — show / `set <path>` /
`clear` / `pick` (open the directory browser).
- routes/workspace_routes.py: GET /api/workspace/browse (admin-only).
- src/tool_execution.py: hard path confinement for read_file/write_file;
bash/python cwd. Threaded route → stream_agent_loop → execute_tool_block.
- src/agent_loop.py: workspace note prepended to the system prompt.
- static/: overflow menu item, input-bar pill, directory-browser modal, and
the /workspace slash command.
- tests/test_workspace_confine.py.
* Wire workspace confinement into tools that landed after this PR
edit_file (#1239) and grep/glob/ls (#1670) merged after workspace-confine was
written, so they bypassed the workspace boundary. Thread the workspace through:
- edit_file: _do_edit_file resolves via _resolve_tool_path_in_workspace
- grep/glob/ls: _resolve_search_root confines to the workspace (root + paths)
- bash/python/bg cwd: workspace or _AGENT_WORKDIR (keep the #2586 data-dir
default when no workspace is set)
Tests cover edit_file + grep/ls confinement (inside ok, outside rejected).
* Workspace picker: editable path bar + modal style cohesion + cross-platform hardening
- Make the current-folder strip an editable address bar: type/paste a full
path and press Enter to navigate (also reaches other Windows drives and
hidden dirs the up-only browser cannot).
- Reuse shared modal CSS: drop bespoke .workspace-modal-content/.workspace-btn*
in favour of base .modal-content/.modal-body and the .confirm-btn button
family; separators/hover use var(--border). Net -31 CSS lines.
- Fix the path field overflowing the modal right edge (flex stretch + margin
vs an overflow:auto scrollbar-feedback loop): full-bleed, no h-margin.
- Cross-platform confinement: normcase the workspace commonpath check so
containment holds on case-insensitive filesystems (Windows/macOS).
- Make tests OS-portable: sibling temp dirs instead of /etc, python os.getcwd()
instead of pwd. 5 pass.
Restores src.webhook_manager after a review-regression test imports it against a fake src.database. Fixes one focused #2580 CI-baseline pollution bucket.
* feat: round-limit handling — Continue affordance at the cap + configurable cap
When the agent loop runs out of rounds (per-message step cap, default 20)
while still actively using tools, it stopped silently mid-task. Now:
1. The loop emits a `rounds_exhausted` SSE event at the cap, and the UI shows
a "Continue" pill at the bottom of the chat that resumes the task from where
it left off. Repeated cap-hits each get a fresh Continue (multiple continues
in a row).
2. The cap is configurable in Settings → Agent ("Max steps per message"),
validated on the client, at the save endpoint, and at the read site.
- src/agent_loop.py: track `_exhausted_rounds` (set only when a full
tool-executing round completes on the last allowed round — i.e. the agent
wanted to keep going); emit `{"type":"rounds_exhausted","rounds":N}` (logged).
- routes/chat_routes.py: read `agent_max_rounds` (clamped 1..200), pass as
`max_rounds`; forward the new event through the SSE relay.
- routes/auth_routes.py: validate numeric settings on save (int + clamp;
agent_max_rounds 1..200, agent_max_tool_calls 0..1000; 400 on non-int).
- src/settings.py: default `agent_max_rounds = 20`.
- static/: Settings input + client-side clamp; the Continue pill (reuses the
existing .stopped-indicator / .continue-btn classes and theme vars
--border/--fg/--bg/--accent); appended to the chat container so it survives
the message re-render at stream finalize. chat.js cache version bumped.
* test: cover rounds_exhausted emission (cap-hit vs normal finish)
Drives the real stream_agent_loop with mocked LLM stream / tool exec / settings:
a tool block every round exhausts the cap and must emit rounds_exhausted; a
plain answer hits the done-break and must not. Guards the for/else logic.
Restores both sys.modules and parent src.database package state after the webhook SSRF tests import src.webhook_manager against the real database module. Fixes one focused #2580 CI-baseline pollution bucket.
Keeps src.request_models real and restores both sys.modules and parent routes.session_routes package attributes after temporary test stubs. Restores one focused part of the Python CI baseline tracked in #2580.
Slash-command replies and the echoed /setup command are persisted to session
history so they render in the transcript, but they are UI chatter the user
never meant as conversation. They were sent to the model on the next turn,
which then commented on '/setup ...' and exposed transient values (e.g. the
Copilot device user_code) to the LLM.
- get_context_messages() (the LLM-API view) now skips messages tagged
metadata.source == 'slash'. Display/history-load paths use raw history and
are unaffected.
- slashCommands.js tags the echoed user command with source:'slash' too (the
assistant replies already carried it); the user line was the one untagged
path that still reached context.
Fixes#2634.
truncate_messages deletes db_messages[keep_count:] (a no-op when
keep_count >= the real message total) then unconditionally wrote
db_session.message_count = keep_count. When keep_count exceeds the
number of messages that actually exist — e.g. the manage_session AI
tool defaults keep_count to 10, and the HTTP truncate endpoint passes
any client value — the persisted count is set too high (10 on a
3-message session), diverging from the real row count. That column
gates lazy DB-hydration in get_session (message_count > 0) and is
surfaced to the history UI, so it is correctness-relevant. Clamp to
min(keep_count, len(db_messages)); the in-memory slice already caps
naturally.
* feat(provider): add GitHub Copilot provider with device-flow auth
Adds GitHub Copilot as a model provider, so Copilot models (gpt-4o/4.1/5,
Claude, Gemini, …) work through the normal chat + agent loop, incl. native
tool calling and vision.
Auth is one-click via the GitHub OAuth device flow; the access token is stored
as the endpoint's (encrypted) api_key and sent directly as `Authorization:
Bearer` (no Copilot-token exchange, no refresh — matching how editors talk to
the Copilot API). Copilot is a normal ModelEndpoint detected by host; the only
provider-specific behaviour is a small set of required request headers,
injected centrally.
Sign-in is available from Settings → model endpoints ("Connect GitHub
Copilot") and from chat via `/setup copilot`.
- src/copilot.py (new), routes/copilot_routes.py (new): constants, header
builders, device-flow start/poll, model discovery, owner-scoped endpoint
provisioning.
- src/llm_core.py, src/endpoint_resolver.py: detect `copilot`, inject headers,
per-request x-initiator/vision.
- src/agent_loop.py: allowlist api.githubcopilot.com for native tool schemas.
- src/model_context.py: known context windows for Copilot (no unauthenticated
/models probe).
- static/, README, tests/test_copilot*.py.
* Tidy copilot_routes: clarify supports_tools, note _PENDING is per-process
* fix: renaming a user leaves their API tokens resolving to the old owner
* Drive rename token-cache test through the real auth resolver instead of patching a closure
* fix(llm): auto-detect <think> in content stream for unregistered thinking models
_THINKING_MODEL_PATTERNS only covers known model families by name. Qwen3-derived
models with non-standard names (e.g. Qwopus, custom QwQ forks) are not matched,
so their <think>...</think> content streams through as visible chat text instead
of being routed to the thinking display.
When the first content delta opens with <think> and the model was not already
identified as a thinking model, dynamically flag the stream as a thinking model
for the remainder of the response. This enables the existing </think> repair path
(line below) and ensures the frontend receives the full <think>...</think> wrapper
it needs to split thinking from the final answer.
The check is restricted to the very first content delta (_first_content_sent is
False) to avoid misidentifying models that happen to write "<think>" mid-answer.
Fixes#2225
Related: #2420 (covered by separate PR from @AmmarS-Analyst), #2224 (@RaresKeY)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(llm): replace inert _thinking_model flag with _in_think_tag state machine
The original auto-detect set _thinking_model=True on the first <think> chunk
but still emitted it as a regular delta and set _first_content_sent=True
immediately, so no subsequent chunk could enter the repair path.
Replace with _in_think_tag bool: enter thinking mode when first content starts
with <think>, route all chunks to the thinking channel until </think> is found,
then the tail becomes the first regular delta. Adds three regression tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(llm): replace _first_content_sent guard with _think_open_stripped
Opening-tag stripping used `not _first_content_sent` as the guard, but
_first_content_sent stays False throughout the entire think block (it only
flips when regular content is emitted). So `find(">")` ran on every
reasoning chunk — not just the first — and silently truncated everything
before the first ">" in any reasoning text containing comparisons, arrows,
or code.
Fix: add `_think_open_stripped = False` alongside `_in_think_tag`. Use it
as the strip guard in both the "still inside <think>" path and the
"</think> found in same chunk" split path. Set it True once the opening
tag is consumed so all subsequent chunks reach the thinking channel
unmolested.
Add regression test: 3-chunk stream where the middle chunk contains
"c > d" — confirms "more c " is not dropped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes module-level core.database stubbing from the compare endpoint owner-scope regression test and patches ModelEndpoint per test with monkeypatch. Restores one focused part of the Python CI baseline tracked in #2580.
Odysseus only supports llama.cpp on Windows (vLLM/SGLang are
explicitly blocked). llama.cpp requires GGUF, so AWQ/GPTQ/FP8
safetensors models without a GGUF alternate should not be
recommended in the Cookbook on Windows hosts.
Changes:
- hardware.py: add 'platform': 'windows' to _detect_windows()
so downstream logic can identify Windows hosts.
- fit.py: include is_windows in the existing GGUF-only filter
alongside apple_silicon and consumer_amd.
- tests: add test_hwfit_windows.py with regression tests.
Fixes#122, #614 (root cause: unservable models recommended).
Updates the stale gallery owner-filter null-user test to match current single-user/auth-disabled behavior. Restores one focused part of the Python CI baseline tracked in #2580.
Updates the PDF marker regression test to check corrupted markers at line level instead of using a broad substring assertion. Restores one focused part of the Python CI baseline tracked in #2580.
Updates the split_chunks containment regression test to use deterministic non-repeating records instead of a repeating fixture that could produce accidental substring matches. Restores one focused part of the Python CI baseline tracked in #2580.
Updates endpoint/model-route test HTTP mocks to accept the verify keyword argument passed by endpoint probing code. Restores one focused part of the Python CI baseline tracked in #2580.
Gives the agent first-class code navigation instead of shelling out via bash
(token-heavy, unreliable on weaker models, unstructured). Mirrors the
Grep/Glob/Read primitives that Claude Code / opencode expose.
- grep: regex search over file contents across a tree. Uses ripgrep when
available (with explicit excludes so junk dirs are skipped even without a
.gitignore); falls back to a pure-Python walk+regex when rg is absent.
Returns file:line:match, capped.
- glob: find files by glob pattern (recursive), newest first.
- ls: list a directory (folders first, then files with sizes).
- read_file: optional offset/limit for line-range reads of large files
(plain-path calls stay back-compatible).
All confined by the same path policy as read_file (_resolve_tool_path:
data/tmp allowlist + sensitive-file deny). Junk dirs (.git, node_modules,
venv, __pycache__, dist/build, …) skipped. Output capped (200 hits,
400 chars/line). Admin-gated like the other filesystem tools.
Wiring: schemas + native arg->content serializer (src/tool_schemas.py), tool
tags (src/agent_tools.py), always-available + descriptions (src/tool_index.py),
admin gate (src/tool_security.py), dispatch + impls (src/tool_execution.py).
Tests: tests/test_code_nav_tools.py — match/skip-junk/ignore-case/glob-filter,
allowlist rejection, glob/ls, read-range, and the no-ripgrep Python fallback.
* Add edit_file tool + file-change diffs
edit_file is an exact old_string -> new_string replacement on a file on disk
(fails if old_string is missing or non-unique unless replace_all); write_file
also returns a unified diff. Diffs render collapsed in the tool bubble
(filename + +adds/-dels, theme colors); the raw JSON command box is hidden.
Security: edit_file is a sensitive filesystem-write tool, treated everywhere
write_file is —
- added to NON_ADMIN_BLOCKED_TOOLS (is_public_blocked_tool / blocked_tools_for_owner),
so on auth-enabled deployments a non-admin cannot run it; execute_tool_block
refuses it for non-admin owners.
- confined by the same path policy as read_file/write_file (allowlist +
sensitive-file deny) via _resolve_tool_path.
Disambiguation in tool descriptions + bash prompt: edit_file/write_file are the
only way to write files (they show a diff) — never edit_document (editor panel)
or a bash heredoc/redirect.
Tests (tests/test_edit_file.py): non-admin block (policy + execution gate),
successful edit, not-found old_string, non-unique old_string (+ replace_all),
and path outside the allowed roots.
Files: src/tool_execution.py, src/agent_loop.py, src/tool_schemas.py,
src/agent_tools.py, src/tool_index.py, static/js/chat.js, static/style.css,
tests/test_edit_file.py.
* Drop redundant import os in write_file closure
os is already imported at module top.
* chore: dedupe src/search/cache.py into a re-export shim
src/search/cache.py was a byte-identical copy of services/search/cache.py.
Convert it to a sys.modules alias of the canonical services module (matching
src/search/core.py, providers.py, ranking.py) so the two cannot drift, and add
an identity assertion to test_search_module_consolidation.py.
content.py and query.py are intentionally left as-is: the copies have drifted
and services lacks fixes that src has, so they need services reconciled first
before they can be shimmed safely.
* chore: dedupe src/search content.py and query.py into shims
Convert src/search/content.py and query.py to sys.modules aliases of the
canonical services/search/* (matching cache.py, core.py, providers.py,
ranking.py) so the duplicate copies cannot drift.
Repoint the two tests that were coupled to the src-copy internals onto the
canonical services surface (behaviour is equivalent):
- test_src_search_query_nonstring.py: import services.search.query instead of
loading the src file by path.
- test_security_regressions.py::test_web_fetch_guard_blocks_redirect_into_private:
mock httpx.get (services uses the module-level get, not httpx.Client) and
assert on the canonical 'Blocked' message.
Drop the now-redundant [src_content, service_content] parametrization in
test_search_content_extraction_parity.py and test_search_content_url_guards.py
(after the shim both params are the same object); add content/query identity
assertions to test_search_module_consolidation.py.
comprehensive_web_search now called with (query, max_pages, return_sources)
and returns a tuple (_context, results). The test mock still used the old
async signature with max_results/fetch_content and returned a plain list,
causing TypeError on every run.
Fixes#2331
* fix: SSE parser crashes with NoneType on MiniMax-M3 (and any provider sending null choice/usage/tc)
Three guards added in stream_llm:
1. choices[0] null check — MiniMax (and some other providers) send a
choices entry as None. `_choices[0].get("delta")` raised
AttributeError. Now checks `_choices[0] is not None` before calling
.get().
2. usage null guard — j["usage"] can arrive as None (not a dict) on
some providers. Added `or {}` so subsequent .get() calls don't crash.
3. tool_calls null entry skip — individual entries in the tool_calls
array can be None. Added `if tc is None: continue` before
tc.get("function").
All three match the `or {}` / null-guard pattern used elsewhere in the
same block. Safe for all OpenAI-compatible providers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: guard null choice in elif-choices SSE branch
The usage-chunk path already guarded _choices[0] is not None, but the
elif "choices" branch that processes content/tool-call deltas did not.
A chunk like {"choices": [null]} or {"choices": [null], "usage": null}
reaches j["choices"][0].get("delta") and crashes with:
'NoneType' object has no attribute 'get'
Fix: extract choices[0] into _c0 and continue to the next chunk when
it is None, matching the guard already applied in the usage path.
Adds three focused regressions covering the paths the maintainer flagged:
- {"choices": [null]}
- {"choices": [null], "usage": null}
- tool_calls array containing a null entry alongside a valid call
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
MCP server tools were presented to the agent with only their name and a
truncated description: get_tool_descriptions_for_prompt() emitted
"- name: description" and get_all_tools() dropped input_schema entirely.
On the fenced-block tool path (used by Ollama models), the agent could
not see a tool's declared inputs and guessed argument names from the
description alone, so tool calls failed (issue #2509). MCP inspector
showed the schemas fine, confirming the loss was on our side.
- get_all_tools() now carries each tool's input_schema.
- get_tool_descriptions_for_prompt() renders a compact args hint
(parameter names, coarse types, required-ness) via a new
_format_mcp_params() helper, matching the "Args (JSON): {...}" style
the built-in tool descriptions already use.
Fixes#2509
The visual research report is assembled from LLM output over crawled web
pages (untrusted content) and served under a relaxed `script-src
'unsafe-inline'` CSP. Two values reached that HTML without sanitization:
- `_md_to_html` rendered the report markdown via python-markdown, which
passes raw HTML through verbatim, so `<script>` / `<img onerror>` /
`<svg onload>` / `javascript:` links carried in crawled content ran in
the app origin.
- `category` (from the /api/research/start request body, no enum check) was
interpolated raw into `<body class="category-{category}">`.
Allowlist-sanitize the rendered markdown with nh3, keeping the formatting
the report emits (tables, code, details/summary, toc anchors, codehilite
classes, external-link target/rel) while dropping active content, and
html.escape the category. Adds regression tests.
Adding GigaChat (Sber) or an on-premise enterprise LLM gateway as a
model endpoint fails on first probe with
CERTIFICATE_VERIFY_FAILED: self-signed certificate in certificate
chain (_ssl.c:1000)
because their TLS chain is signed by a private root CA (Russian Trusted
Root CA for GigaChat; corporate CA for on-prem) that isn't part of the
default system / certifi trust store. The endpoint shows offline in
the picker even though the URL and API key are correct (issue #722).
The right fix is to extend the trust store, not to weaken verification.
This change:
- src/tls_overrides.py: new module that resolves an opt-in env var
LLM_CA_BUNDLE at import time, builds a shared SSLContext via
ssl.create_default_context() (so the system / certifi bundle is
loaded first) and layers the operator's PEM on top with
load_verify_locations(). Exposes llm_verify() returning a value
suitable for httpx `verify=`. Defaults to True (httpx built-in
trust) when the env var is unset, when the file is missing, or
when the PEM fails to load — verification is never silently
disabled, the warning is logged and we fall back to the safe path.
- src/llm_core.py: thread llm_verify() into the shared AsyncClient
used by stream_llm / streaming completions.
- routes/model_routes.py: thread llm_verify() into the five httpx.get
call sites in _probe_endpoint / _ping_endpoint so adding a
private-CA endpoint goes green on the very first probe and the
picker stops showing it offline.
- .env.example: document LLM_CA_BUNDLE with the GigaChat case as the
concrete example.
Deliberately NOT included: a verify=False knob (global or per-host).
Disabling verification exposes the affected endpoint to MITM, and the
operator-supplied bundle is the correct fix for legitimate private-CA
providers — so the only switch in this PR is the safe one.
Closes#722.
Pip dependency installs are tracked as download tasks but finish with the
runner's "=== Process exited with code 0 ===" sentinel and pip's
"Successfully installed" line — never the HuggingFace download markers
(DONE / 100% / /snapshots/ / DOWNLOAD_OK) the download heuristics look for.
Once the tmux pane is gone, the backend's only completion check is the HF
cache lookup, which a pip package (e.g. llama-cpp-python[server], no "/")
never matches, so it reports "stopped" — and the frontend maps a stopped
download to "crashed". The reconnect loop's session-gone heuristic had the
same gap. Result: a clean install (exit 0) showed "crashed" in the Running
tab while the Dependencies tab correctly showed it installed.
Add a shared _depInstallSucceeded() helper that keys off the exit-0
sentinel (falling back to pip's success line, rejecting ERROR/Traceback)
and wire it into both the session-gone heuristic and the background status
reconciler, gated on payload._dep so real model downloads are unaffected.
Also fixes the pre-existing test_background_status_poll_reconciles_into_local_tasks
assertion that no longer matched the evolved reconciler, and adds regression
coverage for both paths.