Support extra CA bundle for private-CA LLM providers (#769)

Adding GigaChat (Sber) or an on-premise enterprise LLM gateway as a model endpoint fails on first probe with CERTIFICATE_VERIFY_FAILED: self-signed certificate in certificate chain (_ssl.c:1000) because their TLS chain is signed by a private root CA (Russian Trusted Root CA for GigaChat; corporate CA for on-prem) that isn't part of the default system / certifi trust store. The endpoint shows offline in the picker even though the URL and API key are correct (issue #722). The right fix is to extend the trust store, not to weaken verification. This change: - src/tls_overrides.py: new module that resolves an opt-in env var LLM_CA_BUNDLE at import time, builds a shared SSLContext via ssl.create_default_context() (so the system / certifi bundle is loaded first) and layers the operator's PEM on top with load_verify_locations(). Exposes llm_verify() returning a value suitable for httpx `verify=`. Defaults to True (httpx built-in trust) when the env var is unset, when the file is missing, or when the PEM fails to load — verification is never silently disabled, the warning is logged and we fall back to the safe path. - src/llm_core.py: thread llm_verify() into the shared AsyncClient used by stream_llm / streaming completions. - routes/model_routes.py: thread llm_verify() into the five httpx.get call sites in _probe_endpoint / _ping_endpoint so adding a private-CA endpoint goes green on the very first probe and the picker stops showing it offline. - .env.example: document LLM_CA_BUNDLE with the GigaChat case as the concrete example. Deliberately NOT included: a verify=False knob (global or per-host). Disabling verification exposes the affected endpoint to MITM, and the operator-supplied bundle is the correct fix for legitimate private-CA providers — so the only switch in this PR is the safe one. Closes #722.
2026-06-04 17:48:50 +05:30
parent f876fc7704
commit f59edee611
5 changed files with 260 additions and 6 deletions
--- a/src/llm_core.py
+++ b/src/llm_core.py
@@ -129,7 +129,10 @@ def _get_http_client() -> httpx.AsyncClient:
    """Return process-wide AsyncClient. Per-request timeout is passed at call time."""
    global _http_client
    if _http_client is None or _http_client.is_closed:
-        _http_client = httpx.AsyncClient(limits=_http_limits, http2=False)
+        from src.tls_overrides import llm_verify
+        _http_client = httpx.AsyncClient(
+            limits=_http_limits, http2=False, verify=llm_verify(),
+        )
    return _http_client

 def _get_cached_response(cache_key: str) -> Optional[str]:
--- a/src/tls_overrides.py
+++ b/src/tls_overrides.py
@@ -0,0 +1,91 @@
+"""Extended TLS trust store for private-CA LLM providers.
+
+Some upstream LLM providers serve their API over TLS certificates that are
+signed by a private root CA which is not part of the standard system bundle:
+
+  - GigaChat (Sber) uses the Russian Trusted Root CA, not bundled with
+    OpenSSL / certifi / system trust on most non-Russian installs. The
+    chain looks self-signed to Python and the endpoint is marked offline
+    with `CERTIFICATE_VERIFY_FAILED: self-signed certificate in
+    certificate chain` (see issue #722).
+  - On-premise enterprise LLM gateways often present a corporate CA that
+    has not been imported into the runtime's trust store.
+
+Operators point `LLM_CA_BUNDLE` at a PEM file containing the extra CA
+cert(s). The default system / certifi trust store is loaded first, then
+the operator's PEM is layered on top, so verification still happens —
+the trust set just gets larger. We deliberately do not provide a
+"verify=off" knob: weakening verification globally (or per-host) would
+expose those endpoints to MITM, and the operator-supplied bundle is the
+correct fix for legitimate private-CA providers.
+
+Example (GigaChat):
+    # Sber publishes the chain at
+    # https://www.gosuslugi.ru/crt/rootca_ssl_rsa2022.cer
+    # Convert to PEM and point the env var at it.
+    LLM_CA_BUNDLE=/etc/odysseus/ca/russian-trusted-root.pem
+
+Scope:
+    `llm_verify()` is intentionally consumed by only two call sites — the
+    shared async client in `src/llm_core.py` and the endpoint probes in
+    `routes/model_routes.py`. Both reach LLM provider URLs. The override
+    is NOT threaded into web_fetch, search providers, gallery downloads,
+    embeddings, webhook delivery, or anything else that hits arbitrary
+    URLs, and it does NOT affect the app's own browser-facing TLS. That
+    boundary is pinned by `tests/test_tls_overrides_scope.py` — extending
+    it requires updating the allowlist there with a written justification.
+"""
+
+import logging
+import os
+import ssl
+from typing import Optional
+
+logger = logging.getLogger(__name__)
+
+
+_extra_bundle_path: Optional[str] = (os.environ.get("LLM_CA_BUNDLE") or "").strip() or None
+
+
+def _build_ssl_context() -> Optional[ssl.SSLContext]:
+    """Build an SSLContext that uses the default trust store and ALSO trusts
+    the operator-supplied PEM bundle. Returns None when no extra bundle is
+    configured, so callers fall through to httpx's default verify=True."""
+    if not _extra_bundle_path:
+        return None
+    if not os.path.isfile(_extra_bundle_path):
+        logger.warning(
+            "LLM_CA_BUNDLE points at %r but the file does not exist; "
+            "falling back to the default trust store.",
+            _extra_bundle_path,
+        )
+        return None
+    ctx = ssl.create_default_context()
+    try:
+        ctx.load_verify_locations(cafile=_extra_bundle_path)
+    except (ssl.SSLError, OSError) as e:
+        logger.warning(
+            "LLM_CA_BUNDLE=%r failed to load (%s); falling back to the "
+            "default trust store.",
+            _extra_bundle_path, e,
+        )
+        return None
+    logger.info(
+        "Loaded extra CA bundle %r on top of the default trust store.",
+        _extra_bundle_path,
+    )
+    return ctx
+
+
+# Resolved once at import time. The httpx clients in src/llm_core.py are
+# long-lived (process-wide), so editing LLM_CA_BUNDLE requires a restart —
+# matching the existing semantics of LLM_HOST, SEARXNG_INSTANCE, etc.
+_SHARED_SSL_CONTEXT: Optional[ssl.SSLContext] = _build_ssl_context()
+
+
+def llm_verify():
+    """Return the value to pass as `verify=` on httpx.get / httpx.Client /
+    httpx.AsyncClient. Returns the extended-trust SSLContext when
+    LLM_CA_BUNDLE is set and loaded; otherwise True (httpx default — system
+    / certifi bundle, verification fully on)."""
+    return _SHARED_SSL_CONTEXT if _SHARED_SSL_CONTEXT is not None else True