fix(hwfit): detect unified-memory NVIDIA (Grace Blackwell GB10 / DGX Spark) instead of 'No GPU' (#1340) (#1372)

_detect_nvidia parsed nvidia-smi --query-gpu=memory.total,name and did
float(memory.total) per row, dropping the row on ValueError. Grace Blackwell
GB10 (DGX Spark, sm_121) reports memory.total as '[N/A]'/'Not Supported'
because the GPU shares the system LPDDR pool rather than carrying discrete VRAM
— so the only GPU row was dropped and a real GB10 (even with vLLM running on it)
was reported as 'No GPU', breaking Cookbook recommendations and model switching.

Keep a named device whose memory.total is non-numeric: when there are no
discrete-VRAM rows but such unified devices exist, report a unified-memory CUDA
GPU backed by the system RAM pool (has_gpu, name, backend=cuda, count,
unified_memory=True) — mirroring how Apple Silicon and AMD APUs are already
handled. Discrete GPUs are unchanged, and a box with a real discrete GPU keeps
the discrete path.

Adds tests/test_hwfit_unified_nvidia.py with a GB10 nvidia-smi fixture: the
device is detected (not dropped), surfaces through detect_system with
unified_memory propagated, discrete GPUs stay non-unified, and a discrete GPU
takes precedence over an N/A-memory row.

Co-authored-by: NubsCarson <nubs@nubs.site>
This commit is contained in:
Shaw
2026-06-02 14:19:39 -04:00
committed by GitHub
parent 66c9349ee3
commit b54468291e
2 changed files with 98 additions and 0 deletions

View File

@@ -105,6 +105,8 @@ def _detect_nvidia():
return None
gpus = []
# Devices nvidia-smi lists with a real name but a non-numeric memory.total.
unified = []
# nvidia-smi lists GPUs in index order (0,1,2,...), so the row position is
# the CUDA device index we'd pass to CUDA_VISIBLE_DEVICES.
for idx, line in enumerate(out.strip().split("\n")):
@@ -114,9 +116,32 @@ def _detect_nvidia():
vram_mb = float(parts[0])
gpus.append({"index": idx, "name": parts[1], "vram_gb": vram_mb / 1024.0})
except ValueError:
# Grace Blackwell GB10 / DGX Spark and other unified-memory
# NVIDIA parts report memory.total as "[N/A]"/"Not Supported"
# because the GPU shares the system LPDDR pool instead of
# carrying discrete VRAM. Don't drop the device — remember it so
# we report a unified-memory GPU below rather than "No GPU" (#1340).
if parts[1]:
unified.append({"index": idx, "name": parts[1]})
continue
if not gpus:
if unified:
# Unified-memory CUDA box: report the GPU backed by system RAM so the
# Cookbook recommends models and serving works. The pool is shared
# (not per-GPU discrete VRAM), so report the RAM total once.
ram_gb = round(_get_ram_gb(), 1)
gpus = [{"index": g["index"], "name": g["name"], "vram_gb": ram_gb} for g in unified]
return {
"gpu_name": gpus[0]["name"],
"gpu_vram_gb": ram_gb,
"gpu_count": len(gpus),
"gpus": gpus,
"gpu_groups": _group_gpus(gpus),
"homogeneous": True,
"backend": "cuda",
"unified_memory": True,
}
return None
total_vram = sum(g["vram_gb"] for g in gpus)
groups = _group_gpus(gpus)