Commit Graph

5 Commits

Author SHA1 Message Date
pewdiepie-archdaemon
562bc4dedc Cookbook polish: auto-reconnect, ctx slider fixes, scoring, lots of UI
Backend (services/hwfit + routes):
- VRAM column sort now shows global highest first (was special-cased to
  ascending then truncated top-N, which made "highest VRAM" mathematically
  unreachable). Every column path uses reverse=True for the truncation.
- Hardware probe cache TTL 30min -> 24h so changing filters doesn't keep
  re-probing the rig during a session; Rescan button still forces fresh.
- Multi-GPU rigs filter GGUF Q*/IQ quants (vLLM/SGLang can't serve them);
  default non-prequantized to BF16 on 2+ GPUs.
- AWQ / AWQ-8bit / GPTQ-8bit get a -1.0 quality penalty so FP8 wins ties.
- Version-aware tiebreaker (parse Mn.n / Vn) — MiniMax-M2.7 ranks above M2.5.
- hf_models.json: zai-org/GLM-5.1 added; zai-org/GLM-5 quantization flipped
  Q4_K_M -> BF16. DeepSeek-V4-Flash / -Pro + their -Base variants registered
  with new FP4-MoE-Mixed / FP8-Mixed quant keys (calibrated BPP from the
  actual 156 GB / 284 GB disk footprints).
- New FP4-MoE-Mixed + FP8-Mixed entries in QUANT_BPP / QUANT_SPEED_MULT /
  QUANT_QUALITY_PENALTY / QUANT_BYTES_PER_PARAM / PREQUANTIZED_PREFIXES.

Frontend — Scan/Download:
- Engine + Quant swapped in the toolbar; Quant defaults to "All".
- Ctx (range slider) ported from origin/main: 8k/16k/32k/50k/128k/Max. Drag
  re-sorts by vram ascending (smallest fitting first); back to Max → score.
- Ctx slider rail now visible — was background:transparent in a duplicate
  later-cascade rule. Hardcoded grey + !important.
- Search input moved to the far right of the toolbar.
- Type/Standard default; "Context" not uppercased; Search placeholder dimmed.
- Engine "?" + Quant "?" inline help chips inside their dropdown boxes.
- Fit-column dot toggles fit-only filter; un-toggling re-sorts by VRAM desc.
- Quant column truncates to 9 chars + ellipsis ("FP4-MoE-M..."), full in
  tooltip. Smart title-suffix strips the parts already in the repo name
  (QuantTrio/MiniMax-M2-AWQ + quant AWQ-4bit -> just "(4bit)").
- Conditional warning for safetensors models on non-GPU rigs only.
- Dependency Install / Installed / Installed▾ / N/A all 75.85px wide.
- Rebuild llama.cpp moved into the llama_cpp dep row, styled as a tag.
- Foldable Download admin-card (h2 chevron); line under h2 only when folded.
- HF token save gets a green ✓ + "Saved" flash.
- Cached scan no longer counts stalled rows as downloaded.
- Footer: "Request it →" link with GitHub mark to the public discussion
  (#1962) for model-add requests.

Frontend — Running tab:
- Strict download-finish check (DOWNLOAD_OK or /snapshots/, not bare
  "Download complete"). True overall % for multi-shard downloads:
  ((N-1)+frac)/total instead of hf_transfer's per-shard aggregate.
- ETA in the uptime ticker: "downloading: 12m 34s · ETA 1h 23m".
- Clear button kills the tmux session too; if the output still shows a
  live shard line, the pill is hidden + relabels as "reconnect" + revives
  on click.
- Self-heal: on cookbook open AND every bg-monitor cycle (10s, throttled
  to 8s), scan persisted done/error/crashed downloads and probe their
  tmux session — if alive, flip status back to running and reattach.
- Per-launch zombie probe: clicking Download on a model whose persisted
  state is done but tmux is still alive revives the existing task and
  refuses to start a duplicate.
- Pre-launch GPU probe: vllm / sglang / diffusers serve check
  /api/cookbook/gpus first; warns + confirms if no GPU is visible.
- Server-side state guard: rejects "done" POSTs for downloads lacking
  DOWNLOAD_OK / DOWNLOAD_FAILED / /snapshots/ when the last-mentioned
  shard is N<total — stale tabs can't poison persisted state any more.
- Running count includes tasks whose output looks active even if persisted
  status got stuck. Dir text on the running row, font matched to uptime.

Serve panel:
- Ctx text input always resets to model max on open (default 20000 when
  metadata is missing).
- Max Seqs default 8 -> 4. KV Cache dtype select 32px tall.
- Lightning icon on Launch (same as Action toggle).
- Diagnosis card simplified (no fold/copy/dismiss), suggestion font
  matches body; action buttons get icons on the left (Retry/Copy/Edit/
  Install/Kill/Switch/etc.).
- Incomplete-download serve warning when model status is
  downloading / stalled / has_incomplete.
- MTP "?" tooltip ("supported on a few model families … up to ~3× faster").
2026-06-03 20:25:25 +09:00
pewdiepie-archdaemon
eb79b76432 Cookbook: scoring fixes, UI polish, false-finished + stale-state bug fixes
Backend (services/hwfit + routes):
- rank_models picks visible set by REQUESTED column, not always score —
  sorting by Param now shows highest-param models PERIOD (incl. too_tight).
- New fit_only param. Multi-GPU rigs filter GGUF Q*/IQ quants (vLLM/SGLang
  cannot serve them); default non-prequantized to BF16 on 2+ GPUs.
- AWQ / GPTQ-8bit get a -1.0 quality penalty (was 0.0, tied with FP8), so
  FP8 wins when both fit.
- Version-aware tiebreaker (parse Mn.n / Vn) — MiniMax-M2.7 ranks above
  M2.5 on equal composite score; >=100B integers not misread as versions.
- /api/cookbook/hf-latest no longer drops models without an "NB" pattern in
  the repo id (MiniMax-M2.7, DeepSeek-V4-Pro etc. were silently filtered).
- Cached-model scan: atexit flushes models JSON even if the script is
  killed mid-walk; each scan_dir wrapped in try/except; timeout 60s -> 180s.
- KB granularity for sub-MB sizes (was "0 MB" for 12 KB shells). New
  "stalled" status for shells <1 MB with no .incomplete files.
- /api/cookbook/state POST guard: rejects "done" download tasks lacking
  DOWNLOAD_OK / DOWNLOAD_FAILED / /snapshots/ when the last-mentioned
  shard is N<total — stops stale tabs from poisoning persisted state.
- hf_models.json: add zai-org/GLM-5.1; flip zai-org/GLM-5 quantization
  Q4_K_M -> BF16 (it is the native base, not a quant).

Frontend (static/js):
- Scan/Download toolbar: quant defaults to All; ctx slider (8k/16k/32k/
  50k/128k/Max) ported from origin/main with sort=fit on drag, sort=score
  on Max. GPU toggle commits _activeCount to maxGpu on initial render. Fit
  column header tagged with active budget (RAM / GPU / N GPU).
- Foldable Download admin-card: the Download h2 is the chevron trigger;
  state persists in localStorage.
- Download card surfaces destination dir (Dir: <path>). Same dir on running
  task row, font/color matched to uptime (9px Fira Code muted, opacity .4).
- Serve panel ctx text input always resets to model max on open. Sub-MB
  cached models show with red "download stalled" badge.
- Bulk-select Cancel + Delete reset the Select button label on exit.
- Cookbook running: false-finished bug fixed — DOWNLOAD_OK or /snapshots/
  required; bare "Download complete" no longer marks the task done after
  the first config file. Clear button now sends tmux kill-session too.
  True overall % for multi-shard downloads: ((N-1)+frac)/total instead of
  hf_transfer per-shard aggregate.
- Diagnosis card simplified: removed fold toggle, copy button, dismiss X.
  Suggestion font matches message body (12px).
- HF token field flashes green check + "Saved" on save.
- Cached scan no longer counts stalled rows as downloaded in Scan/Download.

CSS:
- dep Install button width pinned to 76px to match Installed split.
- task-sub row +1px; task-status badge gets margin-right 8px.
- Ctx slider styled like gallery editor sliders (thin pill rail, red thumb).
- Bulk-select cancel button top -3px -> -5px.
2026-06-03 16:32:20 +09:00
spooky
033852ab14 fix: require GGUF sources for llama downloads (#368) 2026-06-01 22:47:47 +09:00
Sirsyorrz
9955f5bc95 Fix VRAM estimates for pre-quantized HF repos
The Cookbook fit scanner was reporting impossibly low VRAM requirements
for some pre-quantized models — e.g. cyankiwi/Qwen3-Coder-Next-REAM-AWQ-4bit
shown as 7.1 GB ('perfect' on a 12 GB card) when the real load is ~40 GB.

Root cause is in the catalog builder. When _entry_from_modelinfo falls
back to safetensors metadata for the parameter count, it stored
safetensors.total directly. For pre-quantized repos that figure reflects
*packed* element counts: AWQ/GPTQ-Int4 pack 8x 4-bit weights into one
I32, AWQ-8bit/GPTQ-Int8/FP8 pack 4x. The catalog therefore recorded
~1/8 of the real parameter count, and min_vram_gb = packed * bpp
double-applied the quantization.

Fix the safetensors fallback:

* prefer the per-dtype parameters dict when available and unpack only the
  I32/I64 entries (the F16/BF16 scale/zero tensors and embeddings are
  already at their real element counts)
* fall back to total * pack_factor when only total is exposed

Patch the catalog entries that were affected by the old fallback so the
fit ratings reflect reality without waiting for a full catalog rebuild:

* cyankiwi/Qwen3-Coder-Next-REAM-AWQ-4bit  11.4B -> 79.7B (40.8 GB VRAM)
* stelterlab/Qwen3-Coder-30B-A3B-Instruct-AWQ  4.6B -> 30.5B
* stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ  5.1B -> 30.5B
* warshanks/Qwen3-8B-abliterated-AWQ  2.2B -> 8.2B
* QuantTrio/sarvam-30b-AWQ  7B -> 30B
* QuantTrio/sarvam-105b-AWQ  19B -> 105B

Closes #377.
2026-06-01 18:32:58 +09:00
pewdiepie-archdaemon
e5c99a5eee Odysseus v1.0 2026-05-31 23:58:26 +09:00