Fix VRAM estimates for pre-quantized HF repos

The Cookbook fit scanner was reporting impossibly low VRAM requirements for some pre-quantized models — e.g. cyankiwi/Qwen3-Coder-Next-REAM-AWQ-4bit shown as 7.1 GB ('perfect' on a 12 GB card) when the real load is ~40 GB. Root cause is in the catalog builder. When _entry_from_modelinfo falls back to safetensors metadata for the parameter count, it stored safetensors.total directly. For pre-quantized repos that figure reflects *packed* element counts: AWQ/GPTQ-Int4 pack 8x 4-bit weights into one I32, AWQ-8bit/GPTQ-Int8/FP8 pack 4x. The catalog therefore recorded ~1/8 of the real parameter count, and min_vram_gb = packed * bpp double-applied the quantization. Fix the safetensors fallback: * prefer the per-dtype parameters dict when available and unpack only the I32/I64 entries (the F16/BF16 scale/zero tensors and embeddings are already at their real element counts) * fall back to total * pack_factor when only total is exposed Patch the catalog entries that were affected by the old fallback so the fit ratings reflect reality without waiting for a full catalog rebuild: * cyankiwi/Qwen3-Coder-Next-REAM-AWQ-4bit 11.4B -> 79.7B (40.8 GB VRAM) * stelterlab/Qwen3-Coder-30B-A3B-Instruct-AWQ 4.6B -> 30.5B * stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ 5.1B -> 30.5B * warshanks/Qwen3-8B-abliterated-AWQ 2.2B -> 8.2B * QuantTrio/sarvam-30b-AWQ 7B -> 30B * QuantTrio/sarvam-105b-AWQ 19B -> 105B Closes #377.
2026-06-01 19:32:58 +10:00
parent 16d6484492
commit 9955f5bc95
2 changed files with 68 additions and 39 deletions
--- a/scripts/add_hwfit_models.py
+++ b/scripts/add_hwfit_models.py
@@ -120,20 +120,40 @@ def _entry_from_modelinfo(mi, overrides):
                total = bt
                if ba and active is None:
                    active = ba
-    # Last resort: read safetensors param count (note: for quantized repos this
-    # is the *packed* count, so it's only an approximation).
+    # Determine quant first — we need it to unpack the safetensors fallback.
+    quant = _quant_from_name(name)
+    # Last resort: read safetensors element counts. For pre-quantized repos
+    # (AWQ/GPTQ/MLX-Int4 etc.) the weights are packed: 8× 4-bit weights per
+    # I32 element, 4× 8-bit weights per I32. The bare safetensors total
+    # therefore undercounts real parameter count by the same factor, which
+    # then feeds a wrong `min_vram_gb` downstream. Sum per-dtype and unpack
+    # the packed I32 tensors so the catalog stores the true param count.
    if total is None:
        try:
            full = api.model_info(name, files_metadata=False)
            st = getattr(full, "safetensors", None)
-            if st and getattr(st, "total", None):
-                total = int(st.total)
+            if st:
+                params_by_dtype = getattr(st, "parameters", None) or {}
+                if quant.endswith("4bit") or quant.endswith("Int4"):
+                    pack_factor = 8
+                elif quant.endswith("8bit") or quant.endswith("Int8") or quant == "FP8":
+                    pack_factor = 4
+                else:
+                    pack_factor = 1
+                if params_by_dtype:
+                    # I32/I64 hold the packed quantized weights; everything
+                    # else (F16/BF16 scales, zeros, embeddings) is already at
+                    # its real element count.
+                    packed = sum(c for d, c in params_by_dtype.items() if d in ("I32", "I64"))
+                    rest = sum(c for d, c in params_by_dtype.items() if d not in ("I32", "I64"))
+                    total = packed * pack_factor + rest
+                elif getattr(st, "total", None):
+                    total = int(st.total) * pack_factor
        except Exception:
            pass
    if total is None:
        return None  # can't size it — skip
    pb = total / 1e9
-    quant = _quant_from_name(name)
    created = getattr(mi, "created_at", None)
    rel = created.strftime("%Y-%m-%d") if created else datetime.utcnow().strftime("%Y-%m-%d")
    # Rough RAM/VRAM hints (fit.py recomputes the real requirement from params+quant).