Commit Graph

6 Commits

Author SHA1 Message Date
tanmayraut45
4e440a9fd5 Hwfit: estimate params from config.json fallback
`add_hwfit_models.py` infers `parameter_count` and `parameters_raw` by
regexing the HF repo name for a `<num>B` token, optionally with an
`-A<num>B` MoE active-param suffix. Repos that don't encode a size in
their name at all (e.g. `zai-org/GLM-4.5`, where the "4.5" is a version
not a parameter count) fall through to the safetensors element-count
path. That path works for unquantized FP16 / BF16 repos but is brittle
in two cases the catalog hits often:

1. Author-bulk runs (`AUTHORS = ["cyankiwi"]`) pull pre-quantized AWQ /
   GPTQ / MLX repos. The safetensors metadata stores the packed I32
   tensors and a per-dtype `parameters` map, which the script unpacks
   via a per-quant pack factor. When the upload doesn't populate that
   map (older repos, custom shards), `st.total` is used raw and the
   parameter count is off by 4-8x.
2. Repos where the safetensors block is absent from `model_info()`
   entirely. The current code returns `None` and silently drops the
   model, which then has to be added to `EXTRA_REPOS` by hand with a
   literal `parameter_count` string.

Both are exactly what the issue calls out — the regex / safetensors
combo can't size GLM-4.5 by itself because the name has no `<num>B`
and the upstream repo's safetensors block doesn't carry a usable param
total either.

Add a config.json fallback in front of the safetensors path:

- `_fetch_config_json(repo_id)` downloads `config.json` via
  `hf_hub_download` (so the standard HF on-disk cache handles
  deduplication across runs, no extra cache layer needed). Network /
  404 / gated-repo errors return `None` and the caller proceeds to the
  safetensors fallback. An in-process `_CONFIG_CACHE` dedupes the
  base-model vs. source-repo lookups within a single run.
- `_params_from_config(cfg)` first honours explicit `num_parameters` /
  `n_params` / `total_params` fields when present. Otherwise it sums
  embeddings + attention (GQA-aware via `num_key_value_heads` and
  `head_dim`) + dense MLP (`3 * hidden_size * intermediate_size`,
  covering SwiGLU / GeGLU). For MoE configs it picks up both naming
  conventions in the wild — `num_experts` / `num_experts_per_tok`
  (Qwen3-MoE) and `n_routed_experts` / `n_shared_experts` (GLM-4-MoE,
  DeepSeek-V3) — uses `moe_intermediate_size`, and respects
  `first_k_dense_replace` so the first N layers stay dense. Active
  parameters come out as `num_experts_per_tok + n_shared_experts` of
  the routed experts, which matches how each architecture reports its
  active count.
- In `_entry_from_modelinfo`, try config.json on the source repo first
  (works for unquantized models) and then on the `base_model:` parent
  (covers AWQ / GPTQ children whose own config is just a quantization
  manifest). Both lookups run only when regex + override + base_model
  tag all failed, so the normal author-bulk run still resolves sizes
  from names without touching the Hub.

Spot-checks against the three architecture families this script
actually pulls — within ~5% of the documented param counts, which is
well inside the `parameter_count` rounding (one decimal of "B") and
the `min_vram_gb` downstream bucket:

  Qwen2.5-7B-Instruct      7.62B   (HF card: 7.6B)
  Qwen3-30B-A3B            30.5B / 3.34B active   (card: 30.5B / 3.3B)
  GLM-4.5                  352.7B / 33.6B active  (card: 355B / 32B)

The safetensors path is unchanged and remains the last resort, so
repos with neither a parsable name nor a fetchable config.json behave
exactly as before.

Closes #955.
2026-06-02 20:33:25 +09:00
spooky
cd4f496cb4 Fix native Cookbook quant classification 2026-06-02 13:07:20 +09:00
pewdiepie-archdaemon
966b53df77 Improve Cookbook serve diagnostics and recommendations 2026-06-02 12:15:47 +09:00
Sirsyorrz
9955f5bc95 Fix VRAM estimates for pre-quantized HF repos
The Cookbook fit scanner was reporting impossibly low VRAM requirements
for some pre-quantized models — e.g. cyankiwi/Qwen3-Coder-Next-REAM-AWQ-4bit
shown as 7.1 GB ('perfect' on a 12 GB card) when the real load is ~40 GB.

Root cause is in the catalog builder. When _entry_from_modelinfo falls
back to safetensors metadata for the parameter count, it stored
safetensors.total directly. For pre-quantized repos that figure reflects
*packed* element counts: AWQ/GPTQ-Int4 pack 8x 4-bit weights into one
I32, AWQ-8bit/GPTQ-Int8/FP8 pack 4x. The catalog therefore recorded
~1/8 of the real parameter count, and min_vram_gb = packed * bpp
double-applied the quantization.

Fix the safetensors fallback:

* prefer the per-dtype parameters dict when available and unpack only the
  I32/I64 entries (the F16/BF16 scale/zero tensors and embeddings are
  already at their real element counts)
* fall back to total * pack_factor when only total is exposed

Patch the catalog entries that were affected by the old fallback so the
fit ratings reflect reality without waiting for a full catalog rebuild:

* cyankiwi/Qwen3-Coder-Next-REAM-AWQ-4bit  11.4B -> 79.7B (40.8 GB VRAM)
* stelterlab/Qwen3-Coder-30B-A3B-Instruct-AWQ  4.6B -> 30.5B
* stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ  5.1B -> 30.5B
* warshanks/Qwen3-8B-abliterated-AWQ  2.2B -> 8.2B
* QuantTrio/sarvam-30b-AWQ  7B -> 30B
* QuantTrio/sarvam-105b-AWQ  19B -> 105B

Closes #377.
2026-06-01 18:32:58 +09:00
pewdiepie-archdaemon
0888a3b3e6 Add native Windows compatibility layer 2026-06-01 15:09:47 +09:00
pewdiepie-archdaemon
e5c99a5eee Odysseus v1.0 2026-05-31 23:58:26 +09:00