odysseus/tests/test_hwfit_unified_nvidia.py at dev

Files

Shaw b54468291e fix(hwfit): detect unified-memory NVIDIA (Grace Blackwell GB10 / DGX Spark) instead of 'No GPU' (#1340 ) (#1372 )

_detect_nvidia parsed nvidia-smi --query-gpu=memory.total,name and did
float(memory.total) per row, dropping the row on ValueError. Grace Blackwell
GB10 (DGX Spark, sm_121) reports memory.total as '[N/A]'/'Not Supported'
because the GPU shares the system LPDDR pool rather than carrying discrete VRAM
— so the only GPU row was dropped and a real GB10 (even with vLLM running on it)
was reported as 'No GPU', breaking Cookbook recommendations and model switching.

Keep a named device whose memory.total is non-numeric: when there are no
discrete-VRAM rows but such unified devices exist, report a unified-memory CUDA
GPU backed by the system RAM pool (has_gpu, name, backend=cuda, count,
unified_memory=True) — mirroring how Apple Silicon and AMD APUs are already
handled. Discrete GPUs are unchanged, and a box with a real discrete GPU keeps
the discrete path.

Adds tests/test_hwfit_unified_nvidia.py with a GB10 nvidia-smi fixture: the
device is detected (not dropped), surfaces through detect_system with
unified_memory propagated, discrete GPUs stay non-unified, and a discrete GPU
takes precedence over an N/A-memory row.

Co-authored-by: NubsCarson <nubs@nubs.site>

2026-06-03 03:19:39 +09:00

3.1 KiB

Raw Permalink Blame History

View Raw

3.1 KiB Raw Permalink Blame History

3.1 KiB

Raw Permalink Blame History