fix: CUDA/GPU detection for vLLM and llama.cpp in Docker (#479)

Two bugs caused GPU inference to silently fall back to CPU inside the
Odysseus Docker container even when the GPU was correctly passed through.

## entrypoint.sh — CUDA_HOME detection only covered CUDA 13.x wheels

The nvcc glob only searched
vidia/cu13, which matches the

vidia-nvcc-cu13 pip wheel layout. CUDA 12.x wheels install nvcc to

vidia/cuda_nvcc/bin/nvcc (nvidia-cuda-nvcc-cu12) or
vidia/cu12
(nvidia-nvcc-cu12) — completely different paths. The glob found nothing,
so CUDA_HOME was never set.

Worse, VLLM_USE_FLASHINFER_SAMPLER=0 was inside the same if-block, so it
was never set either. vLLM then tried to JIT-compile the FlashInfer
sampler at startup, failed with 'Could not find nvcc', and crashed — even
though the GPU was fully visible to the container.

Fix: expand the search to also check nvidia/cu12 and nvidia/cuda_nvcc.
Move VLLM_USE_FLASHINFER_SAMPLER=0 to an unconditional export after the
loop (it is sampler-only, no impact on the attention path, and the correct
setting for any container where CUDA headers may be incomplete).

## cookbook_routes.py — llama.cpp Linux source build silently fell back to CPU

The cmake invocation was:
  cmake -B build -DGGML_CUDA=ON 2>/dev/null || cmake -B build

2>/dev/null suppressed all configure errors. When nvcc is absent (the
slim base image has no CUDA toolkit — intentional), cmake fails silently,
then the || fallback re-runs without -DGGML_CUDA=ON. A CPU-only binary is
produced with no warning. Additionally, a stale CMakeCache.txt from the
failed CUDA attempt was reused (no rm -rf build), poisoning the next
configure run. The macOS branch already did rm -rf build for exactly this
reason; the Linux branch did not.

Fix: before cmake, detect pip-installed nvcc across the same three path
patterns as entrypoint.sh and expose it via CUDA_HOME/PATH. If nvcc is
found, run a clean CUDA build with full error visibility. If not, fall
back to a CPU build with an explicit warning telling the user how to get
a GPU build (install vLLM via Cookbook -> Dependencies, which brings the
CUDA wheels including nvcc, then re-launch).

## .env.example — document Windows COMPOSE_FILE separator

Added a comment showing the semicolon separator required on Windows
Docker Desktop alongside the existing colon-separator (Linux) example.
This commit is contained in:
Carlos Arroyo
2026-06-01 15:30:51 +02:00
committed by GitHub
parent 3c6b084f08
commit 00320972dc
3 changed files with 42 additions and 5 deletions

View File

@@ -137,6 +137,7 @@ SEARXNG_INSTANCE=http://localhost:8080
# NVIDIA (requires nvidia-container-toolkit + `nvidia-ctk runtime
# configure --runtime=docker` on the host):
# COMPOSE_FILE=docker-compose.yml:docker/gpu.nvidia.yml
# COMPOSE_FILE=docker-compose.yml;docker/gpu.nvidia.yml #(Windows)
#
# AMD ROCm (requires ROCm drivers on the host):
# COMPOSE_FILE=docker-compose.yml:docker/gpu.amd.yml

View File

@@ -56,13 +56,25 @@ done
# Auto-set CUDA_HOME if a pip-installed nvcc is present, and disable the
# FlashInfer JIT sampler — sampler only, no impact on attention path.
# No-op when vllm isn't installed.
for cu in /app/.local/lib/python*/site-packages/nvidia/cu13; do
#
# Checked layouts (all are real pip-wheel install paths):
# nvidia/cu13 — nvidia-nvcc-cu13 (CUDA 13.x wheel style)
# nvidia/cu12 — nvidia-nvcc-cu12 (CUDA 12.x wheel style)
# nvidia/cuda_nvcc — nvidia-cuda-nvcc-cu12 (older cu12 sub-package style)
for cu in \
/app/.local/lib/python*/site-packages/nvidia/cu13 \
/app/.local/lib/python*/site-packages/nvidia/cu12 \
/app/.local/lib/python*/site-packages/nvidia/cuda_nvcc; do
if [ -x "$cu/bin/nvcc" ]; then
export CUDA_HOME="$cu"
export VLLM_USE_FLASHINFER_SAMPLER="${VLLM_USE_FLASHINFER_SAMPLER:-0}"
break
fi
done
# Disable the FlashInfer JIT sampler unconditionally — it is sampler-only
# and has no impact on the attention path, but requires nvcc + matching
# CUDA headers at startup. Without this, vLLM crashes with "Could not find
# nvcc" even when the GPU itself is fully visible to the container.
export VLLM_USE_FLASHINFER_SAMPLER="${VLLM_USE_FLASHINFER_SAMPLER:-0}"
# Drop root and run the actual app. `gosu` is preferred over `su` /
# `sudo` because it cleans up the process tree (no extra shell layer)

View File

@@ -1004,9 +1004,33 @@ def setup_cookbook_routes() -> APIRouter:
runner_lines.append(' && cmake --build build -j"$NPROC" --target llama-server \\')
runner_lines.append(' && ln -sf ~/llama.cpp/build/bin/llama-server ~/bin/llama-server')
runner_lines.append(' else')
runner_lines.append(' cd ~/llama.cpp && { cmake -B build -DGGML_CUDA=ON 2>/dev/null || cmake -B build; } \\')
runner_lines.append(' && cmake --build build -j"$NPROC" --target llama-server \\')
runner_lines.append(' && ln -sf ~/llama.cpp/build/bin/llama-server ~/bin/llama-server')
# Detect pip-installed nvcc (from vLLM/nvidia CUDA wheels) and put
# it on PATH so cmake's CUDA configure can find it. We check the
# same three layouts as entrypoint.sh:
# nvidia/cu13 — nvidia-nvcc-cu13
# nvidia/cu12 — nvidia-nvcc-cu12
# nvidia/cuda_nvcc — nvidia-cuda-nvcc-cu12 (sub-package style)
runner_lines.append(' for _cudir in ~/.local/lib/python*/site-packages/nvidia/cu13 ~/.local/lib/python*/site-packages/nvidia/cu12 ~/.local/lib/python*/site-packages/nvidia/cuda_nvcc; do')
runner_lines.append(' [ -x "$_cudir/bin/nvcc" ] && export CUDA_HOME="$_cudir" && export PATH="$_cudir/bin:$PATH" && break')
runner_lines.append(' done')
# rm -rf build so a prior poisoned CMakeCache.txt (e.g. from a
# failed CUDA attempt) doesn't cause the next configure to reuse
# stale settings and silently produce a CPU-only binary.
runner_lines.append(' cd ~/llama.cpp && rm -rf build')
runner_lines.append(' if command -v nvcc &>/dev/null; then')
runner_lines.append(' echo "[odysseus] CUDA nvcc found — building llama-server with CUDA (GPU) support..."')
runner_lines.append(' cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \\')
runner_lines.append(' && cmake --build build -j"$NPROC" --target llama-server \\')
runner_lines.append(' && ln -sf ~/llama.cpp/build/bin/llama-server ~/bin/llama-server')
runner_lines.append(' else')
runner_lines.append(' echo "[odysseus] WARNING: nvcc not found — building llama-server for CPU only."')
runner_lines.append(' echo "[odysseus] GPU inference will not be available for this llama.cpp build."')
runner_lines.append(' echo "[odysseus] To get a GPU build, first install vLLM via Cookbook -> Dependencies"')
runner_lines.append(' echo "[odysseus] (its CUDA wheels include nvcc), then re-launch this serve task."')
runner_lines.append(' cmake -B build -DCMAKE_BUILD_TYPE=Release \\')
runner_lines.append(' && cmake --build build -j"$NPROC" --target llama-server \\')
runner_lines.append(' && ln -sf ~/llama.cpp/build/bin/llama-server ~/bin/llama-server')
runner_lines.append(' fi')
runner_lines.append(' fi')
runner_lines.append(' # If the native build failed, fall back to the Python bindings.')
runner_lines.append(' if ! command -v llama-server &>/dev/null && ! python3 -c "import llama_cpp" 2>/dev/null; then')