fix: CUDA/GPU detection for vLLM and llama.cpp in Docker (#479)
Two bugs caused GPU inference to silently fall back to CPU inside the Odysseus Docker container even when the GPU was correctly passed through. ## entrypoint.sh — CUDA_HOME detection only covered CUDA 13.x wheels The nvcc glob only searched vidia/cu13, which matches the vidia-nvcc-cu13 pip wheel layout. CUDA 12.x wheels install nvcc to vidia/cuda_nvcc/bin/nvcc (nvidia-cuda-nvcc-cu12) or vidia/cu12 (nvidia-nvcc-cu12) — completely different paths. The glob found nothing, so CUDA_HOME was never set. Worse, VLLM_USE_FLASHINFER_SAMPLER=0 was inside the same if-block, so it was never set either. vLLM then tried to JIT-compile the FlashInfer sampler at startup, failed with 'Could not find nvcc', and crashed — even though the GPU was fully visible to the container. Fix: expand the search to also check nvidia/cu12 and nvidia/cuda_nvcc. Move VLLM_USE_FLASHINFER_SAMPLER=0 to an unconditional export after the loop (it is sampler-only, no impact on the attention path, and the correct setting for any container where CUDA headers may be incomplete). ## cookbook_routes.py — llama.cpp Linux source build silently fell back to CPU The cmake invocation was: cmake -B build -DGGML_CUDA=ON 2>/dev/null || cmake -B build 2>/dev/null suppressed all configure errors. When nvcc is absent (the slim base image has no CUDA toolkit — intentional), cmake fails silently, then the || fallback re-runs without -DGGML_CUDA=ON. A CPU-only binary is produced with no warning. Additionally, a stale CMakeCache.txt from the failed CUDA attempt was reused (no rm -rf build), poisoning the next configure run. The macOS branch already did rm -rf build for exactly this reason; the Linux branch did not. Fix: before cmake, detect pip-installed nvcc across the same three path patterns as entrypoint.sh and expose it via CUDA_HOME/PATH. If nvcc is found, run a clean CUDA build with full error visibility. If not, fall back to a CPU build with an explicit warning telling the user how to get a GPU build (install vLLM via Cookbook -> Dependencies, which brings the CUDA wheels including nvcc, then re-launch). ## .env.example — document Windows COMPOSE_FILE separator Added a comment showing the semicolon separator required on Windows Docker Desktop alongside the existing colon-separator (Linux) example.
This commit is contained in:
@@ -56,13 +56,25 @@ done
|
||||
# Auto-set CUDA_HOME if a pip-installed nvcc is present, and disable the
|
||||
# FlashInfer JIT sampler — sampler only, no impact on attention path.
|
||||
# No-op when vllm isn't installed.
|
||||
for cu in /app/.local/lib/python*/site-packages/nvidia/cu13; do
|
||||
#
|
||||
# Checked layouts (all are real pip-wheel install paths):
|
||||
# nvidia/cu13 — nvidia-nvcc-cu13 (CUDA 13.x wheel style)
|
||||
# nvidia/cu12 — nvidia-nvcc-cu12 (CUDA 12.x wheel style)
|
||||
# nvidia/cuda_nvcc — nvidia-cuda-nvcc-cu12 (older cu12 sub-package style)
|
||||
for cu in \
|
||||
/app/.local/lib/python*/site-packages/nvidia/cu13 \
|
||||
/app/.local/lib/python*/site-packages/nvidia/cu12 \
|
||||
/app/.local/lib/python*/site-packages/nvidia/cuda_nvcc; do
|
||||
if [ -x "$cu/bin/nvcc" ]; then
|
||||
export CUDA_HOME="$cu"
|
||||
export VLLM_USE_FLASHINFER_SAMPLER="${VLLM_USE_FLASHINFER_SAMPLER:-0}"
|
||||
break
|
||||
fi
|
||||
done
|
||||
# Disable the FlashInfer JIT sampler unconditionally — it is sampler-only
|
||||
# and has no impact on the attention path, but requires nvcc + matching
|
||||
# CUDA headers at startup. Without this, vLLM crashes with "Could not find
|
||||
# nvcc" even when the GPU itself is fully visible to the container.
|
||||
export VLLM_USE_FLASHINFER_SAMPLER="${VLLM_USE_FLASHINFER_SAMPLER:-0}"
|
||||
|
||||
# Drop root and run the actual app. `gosu` is preferred over `su` /
|
||||
# `sudo` because it cleans up the process tree (no extra shell layer)
|
||||
|
||||
Reference in New Issue
Block a user