Adding GigaChat (Sber) or an on-premise enterprise LLM gateway as a
model endpoint fails on first probe with
CERTIFICATE_VERIFY_FAILED: self-signed certificate in certificate
chain (_ssl.c:1000)
because their TLS chain is signed by a private root CA (Russian Trusted
Root CA for GigaChat; corporate CA for on-prem) that isn't part of the
default system / certifi trust store. The endpoint shows offline in
the picker even though the URL and API key are correct (issue #722).
The right fix is to extend the trust store, not to weaken verification.
This change:
- src/tls_overrides.py: new module that resolves an opt-in env var
LLM_CA_BUNDLE at import time, builds a shared SSLContext via
ssl.create_default_context() (so the system / certifi bundle is
loaded first) and layers the operator's PEM on top with
load_verify_locations(). Exposes llm_verify() returning a value
suitable for httpx `verify=`. Defaults to True (httpx built-in
trust) when the env var is unset, when the file is missing, or
when the PEM fails to load — verification is never silently
disabled, the warning is logged and we fall back to the safe path.
- src/llm_core.py: thread llm_verify() into the shared AsyncClient
used by stream_llm / streaming completions.
- routes/model_routes.py: thread llm_verify() into the five httpx.get
call sites in _probe_endpoint / _ping_endpoint so adding a
private-CA endpoint goes green on the very first probe and the
picker stops showing it offline.
- .env.example: document LLM_CA_BUNDLE with the GigaChat case as the
concrete example.
Deliberately NOT included: a verify=False knob (global or per-host).
Disabling verification exposes the affected endpoint to MITM, and the
operator-supplied bundle is the correct fix for legitimate private-CA
providers — so the only switch in this PR is the safe one.
Closes#722.
Scan port 1234 and any custom port from LM_STUDIO_URL, add the LM_STUDIO_URL host to the discovery sweep alongside the Ollama env vars, and tag each discovered endpoint with its provider by fingerprinting the native /api/v1/models response (entries carrying key + architecture). Documents LM_STUDIO_URL in .env.example.
Document safer defaults and deployment guidance for network-accessible
Odysseus installs. The guidance emphasizes keeping auth enabled,
disabling localhost bypass outside development, using secure cookies for
HTTPS/reverse-proxy deployments, and exposing only the authenticated
Odysseus entrypoint through a trusted proxy or private access layer.
Also clarify that bundled services, databases, vector stores,
notification services, and raw model/provider APIs should remain
internal-only.
This is documentation and config-example only. It does not change
runtime behavior.
Two bugs caused GPU inference to silently fall back to CPU inside the
Odysseus Docker container even when the GPU was correctly passed through.
## entrypoint.sh — CUDA_HOME detection only covered CUDA 13.x wheels
The nvcc glob only searched
vidia/cu13, which matches the
vidia-nvcc-cu13 pip wheel layout. CUDA 12.x wheels install nvcc to
vidia/cuda_nvcc/bin/nvcc (nvidia-cuda-nvcc-cu12) or
vidia/cu12
(nvidia-nvcc-cu12) — completely different paths. The glob found nothing,
so CUDA_HOME was never set.
Worse, VLLM_USE_FLASHINFER_SAMPLER=0 was inside the same if-block, so it
was never set either. vLLM then tried to JIT-compile the FlashInfer
sampler at startup, failed with 'Could not find nvcc', and crashed — even
though the GPU was fully visible to the container.
Fix: expand the search to also check nvidia/cu12 and nvidia/cuda_nvcc.
Move VLLM_USE_FLASHINFER_SAMPLER=0 to an unconditional export after the
loop (it is sampler-only, no impact on the attention path, and the correct
setting for any container where CUDA headers may be incomplete).
## cookbook_routes.py — llama.cpp Linux source build silently fell back to CPU
The cmake invocation was:
cmake -B build -DGGML_CUDA=ON 2>/dev/null || cmake -B build
2>/dev/null suppressed all configure errors. When nvcc is absent (the
slim base image has no CUDA toolkit — intentional), cmake fails silently,
then the || fallback re-runs without -DGGML_CUDA=ON. A CPU-only binary is
produced with no warning. Additionally, a stale CMakeCache.txt from the
failed CUDA attempt was reused (no rm -rf build), poisoning the next
configure run. The macOS branch already did rm -rf build for exactly this
reason; the Linux branch did not.
Fix: before cmake, detect pip-installed nvcc across the same three path
patterns as entrypoint.sh and expose it via CUDA_HOME/PATH. If nvcc is
found, run a clean CUDA build with full error visibility. If not, fall
back to a CPU build with an explicit warning telling the user how to get
a GPU build (install vLLM via Cookbook -> Dependencies, which brings the
CUDA wheels including nvcc, then re-launch).
## .env.example — document Windows COMPOSE_FILE separator
Added a comment showing the semicolon separator required on Windows
Docker Desktop alongside the existing colon-separator (Linux) example.
Opt-in overlays under docker/ that pass the host GPU into the odysseus
container. Pick one in .env:
COMPOSE_FILE=docker-compose.yml:docker/gpu.nvidia.yml
COMPOSE_FILE=docker-compose.yml:docker/gpu.amd.yml
Non-GPU users are unaffected (no default merge). README now points at
the overlays instead of the old ad-hoc `gpus: all` suggestion.
Each overlay header notes that it only exposes the GPU devices — the
slim image still needs vLLM / llama-cpp-python / etc. installed via
Cookbook -> Dependencies before models can serve on GPU.
Tested on Arch + Docker 29.5.1 + RTX 4090:
docker compose exec odysseus nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-...)
Cookbook hardware scan reports the 24 GB GPU and recommends GPU-fit
models. `docker compose config` validates cleanly for all three
COMPOSE_FILE variants (base, +nvidia, +amd).
Builds on the structure proposed in #91 by @krllus with the path /
docs fixes from the review on that PR.
Closes#163.
Co-authored-by: krllus <krllus@users.noreply.github.com>