Scan port 1234 and any custom port from LM_STUDIO_URL, add the LM_STUDIO_URL host to the discovery sweep alongside the Ollama env vars, and tag each discovered endpoint with its provider by fingerprinting the native /api/v1/models response (entries carrying key + architecture). Documents LM_STUDIO_URL in .env.example.
Follow-up to the Venice provider PR. Wire api.venice.ai into the three
host allowlists so Venice behaves like the other paid OpenAI-compatible
clouds:
- agent_loop: add api.venice.ai to _API_HOSTS so the agent sends native
OpenAI tool-call schemas (Venice supports function calling) instead of
degrading to fenced-block parsing.
- teacher_escalation: add api.venice.ai to _SOTA_HOSTS so the escalation
loop stays OFF for Venice (it's a paid top-tier API; no need to add
teacher-model latency).
- webhook_routes: add venice to KNOWN_PROVIDERS so the sync chat webhook
can auto-resolve base_url from provider=venice.
Tests: tests/test_venice_hosts.py pins tool-host matching + SOTA
classification for Venice; py_compile on touched modules.
Co-authored-by: Cursor <cursoragent@cursor.com>
Two bugs in the export_ics path:
1. X-WR-CALNAME was written raw: calendar names containing commas,
semicolons or backslashes produced invalid ICS (RFC 5545 §3.3.11
requires those characters to be escaped as \, \; and \\).
Fix: wrap cal.name in the existing _ics_escape() helper, which is
already used for SUMMARY, DESCRIPTION, and LOCATION on the lines
immediately below.
2. DTSTART and DTEND on non-all-day events always emitted the naive
ISO string (e.g. 20260602T100000) regardless of CalendarEvent.is_utc.
Consumers treat a naive datetime as floating/local time, so UTC
events imported into Google Calendar or Apple Calendar shifted by
the user's timezone offset. Fix: append 'Z' when is_utc is True,
matching the pattern already used by the serialise_event() helper
at line 408.
Read a model's capabilities.vision flag from LM Studio's native /api/v1/models so vision finetunes whose names lack a vision keyword still receive images, falling back to the name heuristic when the endpoint doesn't report it. The probe is short-TTL cached and restricted to local/LAN hosts, so remote/cloud endpoints are never contacted.
The Cookbook → Dependencies tab reported llama-cpp-python[server] as "not
installed" even when it was installed and usable for serving. The local check
looked up distribution metadata as pkg["name"].replace("_", "-") — for the
import name `llama_cpp` that yields "llama-cpp", but the module ships in the
`llama-cpp-python` distribution. importlib.metadata.version("llama-cpp") then
raised PackageNotFoundError and the package was marked missing (the import
itself succeeds, which is why serving still worked).
Derive the distribution name from the package's declared pip spec instead
(stripping [extras] and version markers), falling back to the munged import
name only when no pip spec is declared. New _pip_dist_name() helper.
Adds tests/test_cookbook_package_detection.py covering the llama_cpp mapping,
extras/marker stripping, plain names, the no-pip-spec fallback, and that the
route wires the helper in (guarding against the exact regression).
add_document() and add_documents_batch() derive the persistent ChromaDB
document id from Python's built-in hash():
doc_id = f"doc_{hash(text) % 10**16}"
str hashing is randomized per process (PYTHONHASHSEED is on by default), so
the same document text gets a different doc_id on every restart. The dedup
check right after — self._collection.get(ids=[doc_id]) — therefore misses
on restart, and identical documents are re-embedded and re-added as
duplicates each time the app restarts, bloating the vector store and
skewing retrieval.
Derive the id from a stable hashlib.sha256 of the text via a shared
_generate_doc_id() helper, used by both add paths so they agree.
tests/test_rag_vector_id_stability.py runs _generate_doc_id in subprocesses
under PYTHONHASHSEED=0/1/random and asserts the id is identical across all
of them (and differs for different text). Fails before this change.
* Load .env in start-macos.sh for APP_PORT and APP_BIND
Parses .env at startup (consistent with how app.py reads it via
python-dotenv) so APP_PORT and APP_BIND are honoured without having
to retype them on the command line every run.
Resolution order: shell env (ODYSSEUS_PORT / ODYSSEUS_HOST) → .env
(APP_PORT / APP_BIND) → built-in defaults. Existing ODYSSEUS_* shell
overrides are fully preserved.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Document .env support for APP_PORT and APP_BIND in macOS section
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Docking a modal to a window edge pushes the chat aside (body padding via
right-dock-active + --right-dock-w). Three problems on close/reopen:
1. Chat stayed offset after closing a docked modal. The close-watcher only
reacted to the `.hidden` class or DOM removal, but the draggable modals
(calendar, plan, workspace, document, …) close via inline `display:none`.
Watch the `style` attribute too and treat `display:none` as closed.
2. Reopening a previously-docked singleton modal floated it off to the side,
overlapping the chat. The reused element kept its docked inline geometry.
Clear the content's inline position/size on close so it reopens at its CSS
default (centered).
3. Undock wasn't animated. The transition lived on `.right/left-dock-active`,
so removing the class dropped the transition with it and padding snapped to
0. Move the transition to the base `body` so the push animates both ways.
Files: static/js/modalSnap.js, static/style.css.
Checks: node --check static/js/modalSnap.js; verified in-browser (dock → close
→ chat animates back; reopen → centered, no overlap).
Multimodal content (list of {type, text/image_url} blocks) couldn't be
stored in the DB Text column, causing silent persist failures. On reload
the frontend fell back to String() on the array, rendering
[object Object],[object Object] in the chat.
- Serialize list content as JSON in _persist_message()
- Deserialize back to list in _db_to_session() via _parse_msg_content()
- Extract text parts from multimodal arrays in sessions.js instead of
String() coercion
* fix: only extract quotes whose closing quote matches the opening one
* fix: same mismatched-quote bug in the services search copy
* test: extract_quotes requires matching open/close quotes
scroll-snap-type: y mandatory (docs/index.html:28) forces the viewport to
always rest on a snap point. The footer is far shorter than a viewport, so
scrolling down past the last min-height:100vh section snaps back to that
section's start and the footer can never settle in view. Switch the snap
type to 'proximity' so sections still snap when the user is near them but
the footer (and any sub-viewport tail) is freely reachable.
Fixes#8
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CJK and other IME users confirm a candidate from the input-method popup by pressing Enter. The chat composer and the in-place message editor each bind a keydown handler that treats Enter (without Shift) as "submit", but they did not exclude the composition state. Pressing Enter to accept an IME candidate therefore sent the half-composed text (e.g. a stray "ce's") instead of just confirming the candidate.
These textareas intentionally hijack Enter to submit (Enter sends, Shift+Enter inserts a newline), which bypasses the browser's native form submission and the IME guard that comes with it, so the guard has to be re-added explicitly.
Add '&& !e.isComposing' to the three Enter-to-submit handlers: static/app.js (the main composer's button-submit path and its send/new-chat path) and static/js/chat.js (the editor for an already-sent message). Normal Enter (isComposing false) still submits; Shift+Enter still inserts a newline.
Tested: node --check on both files; manually verified with a Chinese IME that pressing Enter to pick a candidate no longer sends, and a message is sent only after composition ends.
* Cookbook fit: consumer-AMD GGUF recommendations + accurate estimates (core logic)
Split of #746 — the estimate/ranking MATH only, so it can be reviewed with tests
first (UI changes follow separately). Backend files only: no static/js here.
services/hwfit/fit.py, services/hwfit/hardware.py:
- Recommend GGUF/llama.cpp on consumer AMD (RDNA, gfx10/11/12) instead of
formats that don't run on consumer Radeon — vLLM-only AWQ/GPTQ/FP8 AND
vendor-specific NVFP4 (NVIDIA) / MLX (Apple). Datacenter Instinct (CDNA) and
CUDA are left untouched.
- More accurate speed estimates across more GPUs (adds RDNA bandwidth data).
- Detect AMD/RDNA GPUs (gpu_family from rocminfo) so fit/serve can branch on it.
tests/test_hwfit_amd.py: AMD recommendation path, quant/bit matching, estimate
realism, gfx RDNA-vs-CDNA classification.
Rebased onto current main (analyze_model gained a scoring_use_case param there;
kept it). Vision detection intentionally NOT added here — main already ships a
"Vision" type filter + multimodal use-case handling; duplicating it was dropped.
Checks: py_compile clean; pytest tests/test_hwfit_amd.py + hwfit/serve suites
= 28 passed; full suite 0 new failures vs main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Tests: assert NVFP4/MLX/FP8 formats are filtered on consumer RDNA
Backs the #972 claim with an explicit regression: no NVIDIA NVFP4, Apple MLX,
or vLLM-only FP8/AWQ/GPTQ repos are recommended on a consumer Radeon, and guards
against vacuity by asserting such repos exist in the catalog.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: omit temperature for OpenAI reasoning models (o1/o3/o4/gpt-5)
These models only accept the default temperature; sending any explicit
value (even 0.0) returns HTTP 400 "Only the default (1) value is
supported". This broke two paths:
- Endpoint probing in _probe_single_model hardcodes temperature: 0.0, so
a perfectly valid o3/gpt-5 endpoint is reported as failing in the
Model Endpoints health check.
- Chat/stream payloads send temperature unconditionally, so a non-default
temperature preset 400s on these models.
The code already special-cases the same model family for
max_completion_tokens, so this adds a sibling _restricts_temperature()
helper and omits the field for those models, letting the API use its
required default. gpt-4.5 is intentionally excluded (not a reasoning
model; accepts temperature normally).
Adds tests/test_llm_core_temperature.py covering the predicate and the
synchronous payload builder.
* fix: also omit temperature for reasoning models on the direct-POST paths
The first commit only covered llm_call/llm_call_async/stream_llm and the
endpoint probe. Email auto-summary, urgency-less spam classification, the
email reply-summary endpoint, and gallery vision tagging build their
OpenAI payloads inline and POST them directly (requests/httpx), bypassing
llm_core — so a reasoning model configured there would still 400 on the
temperature field. These sites already branch on _uses_max_completion_tokens,
so they're the same class; added the matching _restricts_temperature guard.
gallery_routes also gains the max_completion_tokens branch it was missing,
so gpt-5 vision tagging works end to end.
Note: email_pollers urgency scoring goes through llm_call_async and was
already covered.
Surfaces the research_run_timeout_seconds setting (added in #783) in
Settings → Research as a "Max Time" field, and lets 0 disable the
wall-clock cap entirely for long deep-research runs.
- settings.py: document that 0 disables the cap; default stays 1800s.
- research_handler.py: resolve 0 (or negative) to no timeout
(asyncio.wait_for timeout=None); other values stay bounded to
[60, 86400] as before.
- index.html / settings.js: "Max Time" input bound to
research_run_timeout_seconds, validated to {0} ∪ [60, 86400], with
copy making explicit that 0 = no limit (unbounded model/API cost).
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Accessibility: ARIA labels and toggle states
Screen readers couldn't name several icon-only controls or tell whether the
tool toggles were on. This adds accessible names and exposes toggle state,
with no behavior or layout change.
- Icon-only buttons get aria-label: web/shell tool toggles, the "more tools"
overflow button (+ aria-haspopup), and the color-reset buttons.
- Unlabeled inputs/selects get aria-label: memory + skills search, model-picker
search, memory sort, theme font/density selects, and the new-memory / skill
(title, when-to-use, how, tags) fields, which only had a visual floating label.
- Toggle state via aria-pressed, kept in sync at the existing .active write
sites: web/shell toggles (setupToggle) and the Agent/Chat mode buttons
(initModeToggle). Static aria-pressed added in the markup so the attribute
exists before JS runs.
Scope: first slice of the ROADMAP accessibility pass. Focus-visible/contrast,
reduced-motion, and modal dialog roles/focus-trap are left for follow-ups.
Checks: node --check static/app.js. No Python touched.
* Accessibility: mark chat log busy while streaming
The chat log is an aria-live="polite" region, so streaming a response
token-by-token made screen readers announce every partial update — noisy and
unreadable. Set aria-busy="true" on #chat-history while a response streams and
back to "false" in the stream's finally block. Assistive tech then waits for
the settled message and announces it once.
Checks: node --check static/js/chat.js.
* Chat metrics: show backend's true generation t/s, not tokens÷wall-clock
The per-message tokens/sec read low and felt wrong because it was computed as
output_tokens / total_duration, where total_duration is wall-clock including
prefill, tool calls, and network — not pure decode time. llama.cpp already
reports the correct gen speed in its stream (timings.predicted_per_second), but
it was being dropped.
- llm_core.py: when parsing the OpenAI-compatible usage chunk, also read the
sibling `timings` block llama.cpp includes — pass predicted_per_second through
as gen_tps and prompt_per_second as prefill_tps on the usage event.
- agent_loop.py: capture backend_gen_tps/backend_prefill_tps from usage events;
in _compute_final_metrics prefer backend_gen_tps over the wall-clock division
when present (fall back to computed for cloud APIs that omit timings). Tag the
result with tps_source ("backend" vs "computed") and surface prefill_tps.
Result: the displayed t/s now matches the model's real decode speed and is
stable regardless of prompt length (a long prefill no longer deflates it).
Checks: py_compile passes; verified extraction against a real llama.cpp final
chunk (gen 79 t/s surfaced vs the deflated wall-clock figure shown before).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Chat metrics: surface true t/s on the direct-chat path too
Follow-up to the gen-tps work: the non-agent direct-chat stream path in
chat_routes turned the raw `usage` event straight into a metrics event but only
copied token counts — it never set tokens_per_second or response_time. So simple
(non-tool) replies showed "Speed: n/a" / "Time: undefineds" and the chip fell
back to a bare token count ("27 tok") instead of t/s.
Map the usage event's gen_tps (llama.cpp timings.predicted_per_second, added in
the prior commit) into tokens_per_second here too, tag tps_source=backend, and
set response_time from wall-clock for the stats popup.
Checks: py_compile passes; verified llama.cpp emits usage+timings on the final
stream chunk (gen ~90 t/s) that this path consumes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Tests: backend gen/prefill t/s passthrough and preference
Cover the two pieces of the true-t/s metric so it can be reviewed on its own:
- stream_llm surfaces llama.cpp's timings.predicted_per_second /
prompt_per_second as gen_tps / prefill_tps on the usage event (captured
llama.cpp final-chunk fixture), and omits them when the backend reports no
timings.
- _compute_final_metrics prefers backend_gen_tps over output/wall-clock,
tags tps_source ("backend" vs "computed"), and surfaces prefill_tps.
Reuses the fake-client stream harness from test_llm_core_streaming.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(upload): atomic-rename writes for uploads.json + .bak recovery
UploadHandler.save_upload does a read-modify-write of uploads.json via
two open(..., 'w') + json.dump blocks, with no lock, no temp+rename, and
no recovery. N concurrent inserts lost N-1 entries (last writer wins
after the read snapshot is taken); a SIGKILL/SIGTERM mid-json.dump
truncated the file and the bare 'except Exception: logger.warning(...)'
recovery path returned {}, silently dropping every prior upload.
The handler now serialises the RMW under a per-instance threading.Lock
and writes through _atomic_write_json, which writes to a tempfile in
the same directory, fsyncs, snapshots the previous live to .bak, and
renames the temp onto the target via os.replace. os.replace is atomic
on POSIX, so a reader sees either the old or the new state, never a
half-written file. _load_upload_index tries the live file first, then
falls back to the .bak sibling if the live is corrupt.
Cross-process safety is still on the deployer: gunicorn workers on
the same uploads dir will race the lock, and the atomic-rename is the
kernel-level guarantee that prevents torn reads. If multi-worker
writes are expected, fcntl.flock around the rename is a follow-up;
single-worker and async deployments are correct as-is.
* fix(upload): reload uploads.json inside _index_lock on dedupe path
The duplicate-detection branch in save_upload() was reading uploads.json
*before* taking _index_lock, then writing that stale snapshot under the
lock. A duplicate upload racing with a new-entry insert could clobber
the new entry because the duplicate's snapshot predated the insert.
The new-entry branch already reloaded inside the lock; the duplicate
branch now does the same. It also re-resolves the storage key inside
the lock, because a concurrent insert can have changed the dict's keys.
If the entry has been cleaned up between the outer read and the inner
write, the function falls through to the fresh-insert path instead of
silently writing a stale row.
Boundary note: the _index_lock serialises writers within a single
Python process. Cross-process / multi-worker deployments still need
flock or a database; the inline comment is updated to make this
explicit. The atomic-rename write keeps the on-disk state consistent
but does not serialise writers across processes.
Tests:
- Existing concurrent-insert and partial-write-recovery tests still pass.
- New test_atomic_write_primitives_present_in_production_code asserts
the production module has at least two 'with self._index_lock:' blocks
(regression net for this fix).
- New smoke tests: normal upload, duplicate detection, info lookup
after a backup-recovery scenario.
A session that exists only in the in-memory SessionManager — never persisted,
or whose DB row was removed out-of-band — was listed by GET /api/sessions (the
list is built from the in-memory manager) but 404'd on every per-session
operation, so it could never be deleted.
Two causes, both fixed:
1. _verify_session_owner() only consulted the DB and raised 404 when no row
existed. It now falls back to the in-memory session's owner when (and only
when) a session_manager is supplied and the caller actually owns the ghost.
The DB row stays authoritative when present, and a ghost owned by another
user still 404s, so the ownership/security model is unchanged. The new
parameter defaults to None, preserving behavior for all other callers.
2. SessionManager.delete_session() only removed the in-memory entry when a DB
row was found, so memory-only ghosts survived. It now drops the in-memory
copy regardless and reports success when either the DB row or the in-memory
entry was removed.
Added tests/test_session_ghost_delete.py covering both layers, including the
cross-owner 404, the unauthenticated 403, DB-row-wins precedence, and backward
compatibility when no manager is passed.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
The 'add' action runs due_date through parse_due_for_user (natural
language like 'tomorrow at 9am', plus user-tz anchoring for naive ISO),
but 'update' stored the raw value verbatim. A reminder edited with
natural language was saved as an unparseable literal the frontend's
new Date() can't read, so it never fired. Route update's due_date
through the same parser as add.
analyze_topics() iterates session_manager.sessions and reads
session_data.get("history", []) directly. But SessionManager.load_sessions
seeds sessions metadata-only with empty history — messages are loaded
lazily, only when get_session(session_id) is called. So analyze_topics saw
empty history for every session that hadn't been individually opened this
process lifetime and reported total_topics: 0, even when the database held
plenty of matching messages.
Hydrate each candidate session via session_manager.get_session(session_id)
(the existing lazy-load path) before reading its history, after the
owner/archived filters so skipped sessions aren't loaded. Falls back to the
raw cached history when the manager has no get_session (test stubs).
tests/test_topic_analyzer.py: new test_topic_analyzer_hydrates_sessions
seeds a real SQLite DB with a session + message, runs the real
SessionManager (asserting cached history starts empty), then asserts
analyze_topics finds the topic. Fails before this change. The existing
keyword tests now pass an explicit owner to satisfy the owner-required
early return.
asyncio.wait_for wrapped create_subprocess_exec, which returns as soon
as the child is spawned, so the timeout never bounded the actual work.
yt-dlp could hang indefinitely on proc.communicate() and the
except asyncio.TimeoutError branch was unreachable. Bind the wait to
communicate() and kill/reap the child if it overruns.
After a non-native tool round, the agent appends tool results as a {role:
'user'} message next to the user's original 'user' prompt, producing two
consecutive 'user' messages. Strict provider APIs (Anthropic/Claude) reject
consecutive same-role messages, so the follow-up generation request fails
silently — search returns sources, then nothing is generated.
_sanitize_llm_messages now merges consecutive 'user' messages (joining their
content). Only user/user is merged; normal chat and agent/tool turns already
alternate and are untouched.
Scoped down per maintainer review: the agent_loop 'output' source-extraction
change is already on main (#898/#901) and the broad-mocking web-sources test
was dropped. Added a focused test that runs consecutive-user messages through
the real _build_anthropic_payload and asserts the payload alternates correctly.
TTSService._put_cache writes .mp3 for MP3 audio (ID3/MPEG-framed bytes) and
.wav otherwise, and the rest of the class treats both as cache entries
(_get_cache iterates (".mp3", ".wav"); eviction globs "*.*"). But
get_stats() enumerated the cache with `glob("*.wav")` only, so both
cache_entries and cache_size_mb undercounted — reporting 0 whenever the
cache held MP3 files, which is the common case for most TTS providers.
Glob both extensions so the reported stats match what's actually cached.
tests/test_tts_cache_stats.py writes an MP3-headed blob via _put_cache and
asserts get_stats() reports one entry with non-zero size. Fails before this
change.