`add_hwfit_models.py` infers `parameter_count` and `parameters_raw` by
regexing the HF repo name for a `<num>B` token, optionally with an
`-A<num>B` MoE active-param suffix. Repos that don't encode a size in
their name at all (e.g. `zai-org/GLM-4.5`, where the "4.5" is a version
not a parameter count) fall through to the safetensors element-count
path. That path works for unquantized FP16 / BF16 repos but is brittle
in two cases the catalog hits often:
1. Author-bulk runs (`AUTHORS = ["cyankiwi"]`) pull pre-quantized AWQ /
GPTQ / MLX repos. The safetensors metadata stores the packed I32
tensors and a per-dtype `parameters` map, which the script unpacks
via a per-quant pack factor. When the upload doesn't populate that
map (older repos, custom shards), `st.total` is used raw and the
parameter count is off by 4-8x.
2. Repos where the safetensors block is absent from `model_info()`
entirely. The current code returns `None` and silently drops the
model, which then has to be added to `EXTRA_REPOS` by hand with a
literal `parameter_count` string.
Both are exactly what the issue calls out — the regex / safetensors
combo can't size GLM-4.5 by itself because the name has no `<num>B`
and the upstream repo's safetensors block doesn't carry a usable param
total either.
Add a config.json fallback in front of the safetensors path:
- `_fetch_config_json(repo_id)` downloads `config.json` via
`hf_hub_download` (so the standard HF on-disk cache handles
deduplication across runs, no extra cache layer needed). Network /
404 / gated-repo errors return `None` and the caller proceeds to the
safetensors fallback. An in-process `_CONFIG_CACHE` dedupes the
base-model vs. source-repo lookups within a single run.
- `_params_from_config(cfg)` first honours explicit `num_parameters` /
`n_params` / `total_params` fields when present. Otherwise it sums
embeddings + attention (GQA-aware via `num_key_value_heads` and
`head_dim`) + dense MLP (`3 * hidden_size * intermediate_size`,
covering SwiGLU / GeGLU). For MoE configs it picks up both naming
conventions in the wild — `num_experts` / `num_experts_per_tok`
(Qwen3-MoE) and `n_routed_experts` / `n_shared_experts` (GLM-4-MoE,
DeepSeek-V3) — uses `moe_intermediate_size`, and respects
`first_k_dense_replace` so the first N layers stay dense. Active
parameters come out as `num_experts_per_tok + n_shared_experts` of
the routed experts, which matches how each architecture reports its
active count.
- In `_entry_from_modelinfo`, try config.json on the source repo first
(works for unquantized models) and then on the `base_model:` parent
(covers AWQ / GPTQ children whose own config is just a quantization
manifest). Both lookups run only when regex + override + base_model
tag all failed, so the normal author-bulk run still resolves sizes
from names without touching the Hub.
Spot-checks against the three architecture families this script
actually pulls — within ~5% of the documented param counts, which is
well inside the `parameter_count` rounding (one decimal of "B") and
the `min_vram_gb` downstream bucket:
Qwen2.5-7B-Instruct 7.62B (HF card: 7.6B)
Qwen3-30B-A3B 30.5B / 3.34B active (card: 30.5B / 3.3B)
GLM-4.5 352.7B / 33.6B active (card: 355B / 32B)
The safetensors path is unchanged and remains the last resort, so
repos with neither a parsable name nor a fetchable config.json behave
exactly as before.
Closes#955.
The Cookbook fit scanner was reporting impossibly low VRAM requirements
for some pre-quantized models — e.g. cyankiwi/Qwen3-Coder-Next-REAM-AWQ-4bit
shown as 7.1 GB ('perfect' on a 12 GB card) when the real load is ~40 GB.
Root cause is in the catalog builder. When _entry_from_modelinfo falls
back to safetensors metadata for the parameter count, it stored
safetensors.total directly. For pre-quantized repos that figure reflects
*packed* element counts: AWQ/GPTQ-Int4 pack 8x 4-bit weights into one
I32, AWQ-8bit/GPTQ-Int8/FP8 pack 4x. The catalog therefore recorded
~1/8 of the real parameter count, and min_vram_gb = packed * bpp
double-applied the quantization.
Fix the safetensors fallback:
* prefer the per-dtype parameters dict when available and unpack only the
I32/I64 entries (the F16/BF16 scale/zero tensors and embeddings are
already at their real element counts)
* fall back to total * pack_factor when only total is exposed
Patch the catalog entries that were affected by the old fallback so the
fit ratings reflect reality without waiting for a full catalog rebuild:
* cyankiwi/Qwen3-Coder-Next-REAM-AWQ-4bit 11.4B -> 79.7B (40.8 GB VRAM)
* stelterlab/Qwen3-Coder-30B-A3B-Instruct-AWQ 4.6B -> 30.5B
* stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ 5.1B -> 30.5B
* warshanks/Qwen3-8B-abliterated-AWQ 2.2B -> 8.2B
* QuantTrio/sarvam-30b-AWQ 7B -> 30B
* QuantTrio/sarvam-105b-AWQ 19B -> 105B
Closes#377.