Improve Docker GPU setup diagnostics (#705)

* Improve Docker GPU setup diagnostics Add a Docker GPU preflight script for NVIDIA users. The script is read-only by default, checks host NVIDIA drivers, Docker availability, and container GPU passthrough, and prints actionable next steps. Add explicit opt-in modes to print install commands, install NVIDIA Container Toolkit on Ubuntu/Debian, and enable the NVIDIA Compose overlay in .env after passthrough is verified. Document common NVIDIA Docker failure modes, ignore generated .env backups, and clarify that Cookbook can only detect GPUs exposed to the Odysseus container. * Clarify Docker GPU diagnostic limits
2026-06-02 04:30:40 +01:00
parent 517aa593e0
commit 8455b88643
5 changed files with 635 additions and 5 deletions
--- a/README.md
+++ b/README.md
@@ -124,21 +124,65 @@ Odysseus SSH key and add the public key to the remote server's
 ssh-copy-id -i data/ssh/id_ed25519.pub user@server
 ```

-**NVIDIA / AMD Docker GPU overlays.** Install the host runtime first, then add
-one of these to `.env`:
+**NVIDIA Docker GPU overlay.** CPU-only users can skip this section.
+`scripts/check-docker-gpu.sh` diagnoses GPU passthrough and can optionally
+install the host runtime or update `.env`. Cookbook can only detect GPUs that
+Docker exposes to the container — if the host runtime is not configured,
+Cookbook sees the iGPU, another card, or CPU instead of your NVIDIA GPU.
+
+```bash
+# Read-only diagnostic (default — installs nothing, never edits .env):
+scripts/check-docker-gpu.sh
+
+# Print OS-specific install commands without running them:
+scripts/check-docker-gpu.sh --print-install-commands
+
+# Install NVIDIA Container Toolkit on Ubuntu/Debian (requires sudo):
+scripts/check-docker-gpu.sh --install-nvidia-toolkit
+
+# Write COMPOSE_FILE to .env (only when GPU passthrough is confirmed working):
+scripts/check-docker-gpu.sh --enable-nvidia-overlay
+
+# Full assisted setup — install toolkit, then enable overlay if passthrough works:
+scripts/check-docker-gpu.sh --install-nvidia-toolkit --enable-nvidia-overlay
+```
+
+Safety notes:
+- The app never installs host GPU runtime automatically.
+- The app never edits `.env` automatically.
+- `.env` is only modified when `--enable-nvidia-overlay` is explicitly passed,
+  and only after GPU passthrough succeeds. `--yes` skips prompts but does not
+  bypass the passthrough gate.
+- `.env.bak.*` backups created by `--enable-nvidia-overlay` are ignored by
+  Git and the Docker build context.
+
+To enable manually without the script, add this to `.env`:

 ```bash
 COMPOSE_FILE=docker-compose.yml:docker/gpu.nvidia.yml
+```
+
+**AMD / ROCm.** AMD GPU passthrough is not automated. Add manually:
+
+```bash
 COMPOSE_FILE=docker-compose.yml:docker/gpu.amd.yml
 ```

-Verify with:
+Verify after enabling either overlay:

 ```bash
-docker compose exec odysseus nvidia-smi -L
-docker compose exec odysseus rocm-smi
+docker compose exec odysseus nvidia-smi -L   # NVIDIA
+docker compose exec odysseus rocm-smi        # AMD
 ```

+> **GPU passthrough ≠ llama.cpp CUDA.** `nvidia-smi` passing inside the
+> container confirms Docker GPU access, but llama.cpp also needs `cudart` and
+> the CUDA Toolkit at runtime. If Cookbook logs show `Unable to find cudart
+> library`, `Could NOT find CUDAToolkit`, `CUDA Toolkit not found`, or
+> tensors/layers assigned to CPU, that is a Cookbook/llama.cpp build issue —
+> not a Docker passthrough failure. Re-install the serve engine via
+> **Cookbook → Dependencies** to get a CUDA-enabled build.
+
 **Ollama with Docker.** If Ollama runs on the host, add this endpoint in
 Settings: