Cookbook polish: auto-reconnect, ctx slider fixes, scoring, lots of UI
Backend (services/hwfit + routes):
- VRAM column sort now shows global highest first (was special-cased to
ascending then truncated top-N, which made "highest VRAM" mathematically
unreachable). Every column path uses reverse=True for the truncation.
- Hardware probe cache TTL 30min -> 24h so changing filters doesn't keep
re-probing the rig during a session; Rescan button still forces fresh.
- Multi-GPU rigs filter GGUF Q*/IQ quants (vLLM/SGLang can't serve them);
default non-prequantized to BF16 on 2+ GPUs.
- AWQ / AWQ-8bit / GPTQ-8bit get a -1.0 quality penalty so FP8 wins ties.
- Version-aware tiebreaker (parse Mn.n / Vn) — MiniMax-M2.7 ranks above M2.5.
- hf_models.json: zai-org/GLM-5.1 added; zai-org/GLM-5 quantization flipped
Q4_K_M -> BF16. DeepSeek-V4-Flash / -Pro + their -Base variants registered
with new FP4-MoE-Mixed / FP8-Mixed quant keys (calibrated BPP from the
actual 156 GB / 284 GB disk footprints).
- New FP4-MoE-Mixed + FP8-Mixed entries in QUANT_BPP / QUANT_SPEED_MULT /
QUANT_QUALITY_PENALTY / QUANT_BYTES_PER_PARAM / PREQUANTIZED_PREFIXES.
Frontend — Scan/Download:
- Engine + Quant swapped in the toolbar; Quant defaults to "All".
- Ctx (range slider) ported from origin/main: 8k/16k/32k/50k/128k/Max. Drag
re-sorts by vram ascending (smallest fitting first); back to Max → score.
- Ctx slider rail now visible — was background:transparent in a duplicate
later-cascade rule. Hardcoded grey + !important.
- Search input moved to the far right of the toolbar.
- Type/Standard default; "Context" not uppercased; Search placeholder dimmed.
- Engine "?" + Quant "?" inline help chips inside their dropdown boxes.
- Fit-column dot toggles fit-only filter; un-toggling re-sorts by VRAM desc.
- Quant column truncates to 9 chars + ellipsis ("FP4-MoE-M..."), full in
tooltip. Smart title-suffix strips the parts already in the repo name
(QuantTrio/MiniMax-M2-AWQ + quant AWQ-4bit -> just "(4bit)").
- Conditional warning for safetensors models on non-GPU rigs only.
- Dependency Install / Installed / Installed▾ / N/A all 75.85px wide.
- Rebuild llama.cpp moved into the llama_cpp dep row, styled as a tag.
- Foldable Download admin-card (h2 chevron); line under h2 only when folded.
- HF token save gets a green ✓ + "Saved" flash.
- Cached scan no longer counts stalled rows as downloaded.
- Footer: "Request it →" link with GitHub mark to the public discussion
(#1962) for model-add requests.
Frontend — Running tab:
- Strict download-finish check (DOWNLOAD_OK or /snapshots/, not bare
"Download complete"). True overall % for multi-shard downloads:
((N-1)+frac)/total instead of hf_transfer's per-shard aggregate.
- ETA in the uptime ticker: "downloading: 12m 34s · ETA 1h 23m".
- Clear button kills the tmux session too; if the output still shows a
live shard line, the pill is hidden + relabels as "reconnect" + revives
on click.
- Self-heal: on cookbook open AND every bg-monitor cycle (10s, throttled
to 8s), scan persisted done/error/crashed downloads and probe their
tmux session — if alive, flip status back to running and reattach.
- Per-launch zombie probe: clicking Download on a model whose persisted
state is done but tmux is still alive revives the existing task and
refuses to start a duplicate.
- Pre-launch GPU probe: vllm / sglang / diffusers serve check
/api/cookbook/gpus first; warns + confirms if no GPU is visible.
- Server-side state guard: rejects "done" POSTs for downloads lacking
DOWNLOAD_OK / DOWNLOAD_FAILED / /snapshots/ when the last-mentioned
shard is N<total — stale tabs can't poison persisted state any more.
- Running count includes tasks whose output looks active even if persisted
status got stuck. Dir text on the running row, font matched to uptime.
Serve panel:
- Ctx text input always resets to model max on open (default 20000 when
metadata is missing).
- Max Seqs default 8 -> 4. KV Cache dtype select 32px tall.
- Lightning icon on Launch (same as Action toggle).
- Diagnosis card simplified (no fold/copy/dismiss), suggestion font
matches body; action buttons get icons on the left (Retry/Copy/Edit/
Install/Kill/Switch/etc.).
- Incomplete-download serve warning when model status is
downloading / stalled / has_incomplete.
- MTP "?" tooltip ("supported on a few model families … up to ~3× faster").
This commit is contained in:
@@ -5110,6 +5110,100 @@
|
||||
"release_date": "2023-10-29",
|
||||
"_discovered": true
|
||||
},
|
||||
{
|
||||
"name": "deepseek-ai/DeepSeek-V4-Flash",
|
||||
"provider": "deepseek-ai",
|
||||
"parameter_count": "284B",
|
||||
"parameters_raw": 284000000000,
|
||||
"active_parameters": 13000000000,
|
||||
"is_moe": true,
|
||||
"min_ram_gb": 200.0,
|
||||
"recommended_ram_gb": 320.0,
|
||||
"min_vram_gb": 156.0,
|
||||
"quantization": "FP4-MoE-Mixed",
|
||||
"context_length": 1000000,
|
||||
"use_case": "General-purpose reasoning, long-context",
|
||||
"capabilities": [
|
||||
"long_context",
|
||||
"reasoning",
|
||||
"moe"
|
||||
],
|
||||
"pipeline_tag": "text-generation",
|
||||
"architecture": "deepseek_v4_moe",
|
||||
"hf_downloads": 3542202,
|
||||
"hf_likes": 0,
|
||||
"release_date": "2026-05-15"
|
||||
},
|
||||
{
|
||||
"name": "deepseek-ai/DeepSeek-V4-Flash-Base",
|
||||
"provider": "deepseek-ai",
|
||||
"parameter_count": "284B",
|
||||
"parameters_raw": 284000000000,
|
||||
"active_parameters": 13000000000,
|
||||
"is_moe": true,
|
||||
"min_ram_gb": 290.0,
|
||||
"recommended_ram_gb": 460.0,
|
||||
"min_vram_gb": 284.0,
|
||||
"quantization": "FP8-Mixed",
|
||||
"context_length": 1000000,
|
||||
"use_case": "Base pretrained \u2014 fine-tuning starting point",
|
||||
"capabilities": [
|
||||
"long_context",
|
||||
"moe"
|
||||
],
|
||||
"pipeline_tag": "text-generation",
|
||||
"architecture": "deepseek_v4_moe",
|
||||
"hf_downloads": 0,
|
||||
"hf_likes": 0,
|
||||
"release_date": "2026-05-15"
|
||||
},
|
||||
{
|
||||
"name": "deepseek-ai/DeepSeek-V4-Pro",
|
||||
"provider": "deepseek-ai",
|
||||
"parameter_count": "1.6T",
|
||||
"parameters_raw": 1600000000000,
|
||||
"active_parameters": 49000000000,
|
||||
"is_moe": true,
|
||||
"min_ram_gb": 1100.0,
|
||||
"recommended_ram_gb": 1800.0,
|
||||
"min_vram_gb": 880.0,
|
||||
"quantization": "FP4-MoE-Mixed",
|
||||
"context_length": 1000000,
|
||||
"use_case": "Flagship reasoning, long-context",
|
||||
"capabilities": [
|
||||
"long_context",
|
||||
"reasoning",
|
||||
"moe"
|
||||
],
|
||||
"pipeline_tag": "text-generation",
|
||||
"architecture": "deepseek_v4_moe",
|
||||
"hf_downloads": 0,
|
||||
"hf_likes": 0,
|
||||
"release_date": "2026-05-15"
|
||||
},
|
||||
{
|
||||
"name": "deepseek-ai/DeepSeek-V4-Pro-Base",
|
||||
"provider": "deepseek-ai",
|
||||
"parameter_count": "1.6T",
|
||||
"parameters_raw": 1600000000000,
|
||||
"active_parameters": 49000000000,
|
||||
"is_moe": true,
|
||||
"min_ram_gb": 1700.0,
|
||||
"recommended_ram_gb": 2600.0,
|
||||
"min_vram_gb": 1600.0,
|
||||
"quantization": "FP8-Mixed",
|
||||
"context_length": 1000000,
|
||||
"use_case": "Base pretrained \u2014 fine-tuning starting point",
|
||||
"capabilities": [
|
||||
"long_context",
|
||||
"moe"
|
||||
],
|
||||
"pipeline_tag": "text-generation",
|
||||
"architecture": "deepseek_v4_moe",
|
||||
"hf_downloads": 0,
|
||||
"hf_likes": 0,
|
||||
"release_date": "2026-05-15"
|
||||
},
|
||||
{
|
||||
"name": "deepseek-ai/deepseek-coder-6.7b-base",
|
||||
"provider": "DeepSeek",
|
||||
@@ -13886,53 +13980,6 @@
|
||||
"gguf_sources": [],
|
||||
"capabilities": []
|
||||
},
|
||||
{
|
||||
"name": "deepseek-ai/DeepSeek-V4-Flash",
|
||||
"provider": "DeepSeek",
|
||||
"parameter_count": "158B",
|
||||
"parameters_raw": 158000000000,
|
||||
"min_ram_gb": 165.0,
|
||||
"recommended_ram_gb": 205.0,
|
||||
"min_vram_gb": 165.0,
|
||||
"quantization": "FP8",
|
||||
"context_length": 1000000,
|
||||
"use_case": "General purpose, reasoning (MoE)",
|
||||
"is_moe": true,
|
||||
"num_experts": null,
|
||||
"active_experts": null,
|
||||
"active_parameters": 13000000000,
|
||||
"architecture": "deepseek_v4",
|
||||
"pipeline_tag": "text-generation",
|
||||
"release_date": "2026-04-22",
|
||||
"gguf_sources": [
|
||||
{
|
||||
"repo": "unsloth/DeepSeek-V4-Flash",
|
||||
"provider": "unsloth"
|
||||
}
|
||||
],
|
||||
"capabilities": []
|
||||
},
|
||||
{
|
||||
"name": "deepseek-ai/DeepSeek-V4-Pro",
|
||||
"provider": "DeepSeek",
|
||||
"parameter_count": "1600B",
|
||||
"parameters_raw": 1600000000000,
|
||||
"min_ram_gb": 928.5,
|
||||
"recommended_ram_gb": 1207.0,
|
||||
"min_vram_gb": 928.5,
|
||||
"quantization": "Q4_K_M",
|
||||
"context_length": 1000000,
|
||||
"use_case": "Frontier reasoning (MoE)",
|
||||
"is_moe": true,
|
||||
"num_experts": null,
|
||||
"active_experts": null,
|
||||
"active_parameters": 49000000000,
|
||||
"architecture": "deepseek_v4",
|
||||
"pipeline_tag": "text-generation",
|
||||
"release_date": "2026-04-22",
|
||||
"gguf_sources": [],
|
||||
"capabilities": []
|
||||
},
|
||||
{
|
||||
"name": "google/gemma-4-E2B-it",
|
||||
"provider": "Google",
|
||||
|
||||
@@ -564,7 +564,7 @@ def rank_models(system, use_case=None, limit=50, search=None, sort="score", quan
|
||||
})
|
||||
if use_case == "image_gen":
|
||||
sort_fn = SORT_KEYS.get(sort, SORT_KEYS["score"])
|
||||
results.sort(key=sort_fn, reverse=(sort != "vram"))
|
||||
results.sort(key=sort_fn, reverse=True) # see main path below
|
||||
return results[:limit]
|
||||
|
||||
# If user picked a native prequantized format, filter to only those models.
|
||||
@@ -661,7 +661,10 @@ def rank_models(system, use_case=None, limit=50, search=None, sort="score", quan
|
||||
# explicitly asked for a Fit-only view.
|
||||
results = [r for r in results if r.get("fit_level") != "too_tight"]
|
||||
sort_fn = SORT_KEYS.get(sort, SORT_KEYS["score"])
|
||||
# vram ascending (smallest first), everything else descending (biggest first)
|
||||
results.sort(key=sort_fn, reverse=(sort != "vram"))
|
||||
# Always sort descending then truncate top-N so each column shows the
|
||||
# global highest by that metric. Before, vram was special-cased
|
||||
# ascending → truncate kept the 50 SMALLEST models and "highest VRAM"
|
||||
# could never appear, breaking the column-click toggle.
|
||||
results.sort(key=sort_fn, reverse=True)
|
||||
results = results[:limit]
|
||||
return results
|
||||
|
||||
@@ -5,7 +5,9 @@ import shutil
|
||||
import subprocess
|
||||
import time
|
||||
|
||||
CACHE_TTL = 1800 # 30 min — hardware rarely changes; use the Rescan button to force a re-probe
|
||||
CACHE_TTL = 24 * 3600 # 24 h — hardware probes are user-initiated via the Rescan button; bumped
|
||||
# from 30 min so changing filters doesn't keep re-probing the rig every
|
||||
# half-hour during a long session.
|
||||
|
||||
|
||||
_remote_host = None # set by detect_system(host=...)
|
||||
|
||||
@@ -13,6 +13,13 @@ QUANT_BPP = {
|
||||
"AWQ-4bit": 0.50, "AWQ-8bit": 1.0,
|
||||
"GPTQ-Int4": 0.50, "GPTQ-Int8": 1.0,
|
||||
"mlx-4bit": 0.55, "mlx-8bit": 1.0, "mlx-6bit": 0.75,
|
||||
# DeepSeek-V4-style mixed: MoE experts in FP4 (bulk), attention + non-
|
||||
# expert dense in FP8, embeddings/LM head in BF16. By weight count the
|
||||
# experts dominate so the effective BPP sits closer to FP4 than FP8.
|
||||
# Empirical: DeepSeek-V4-Flash 284B / 156 GB ≈ 0.55 B/param.
|
||||
"FP4-MoE-Mixed": 0.55,
|
||||
# FP8-Mixed = the *-Base variants (MoE experts also FP8, not FP4).
|
||||
"FP8-Mixed": 1.0,
|
||||
}
|
||||
|
||||
QUANT_SPEED_MULT = {
|
||||
@@ -24,6 +31,8 @@ QUANT_SPEED_MULT = {
|
||||
"AWQ-4bit": 1.2, "AWQ-8bit": 0.85,
|
||||
"GPTQ-Int4": 1.2, "GPTQ-Int8": 0.85,
|
||||
"mlx-4bit": 1.15, "mlx-8bit": 0.85, "mlx-6bit": 1.0,
|
||||
"FP4-MoE-Mixed": 1.10, # slightly slower than pure FP4 because of mixed-dtype dispatch
|
||||
"FP8-Mixed": 0.85,
|
||||
}
|
||||
|
||||
QUANT_QUALITY_PENALTY = {
|
||||
@@ -39,6 +48,11 @@ QUANT_QUALITY_PENALTY = {
|
||||
"AWQ": -1.0, "AWQ-4bit": -4.0, "AWQ-8bit": -1.0,
|
||||
"GPTQ": -1.0, "GPTQ-Int4": -4.0, "GPTQ-Int8": -1.0,
|
||||
"mlx-4bit": -4.0, "mlx-8bit": -0.5, "mlx-6bit": -1.5,
|
||||
# DeepSeek-V4 mixed: only MoE experts at FP4 (the rest is FP8/BF16),
|
||||
# so the realized quality is much closer to FP8 than to pure FP4 —
|
||||
# the activation-sensitive layers stay high-precision. ~0 penalty.
|
||||
"FP4-MoE-Mixed": -0.5,
|
||||
"FP8-Mixed": 0.0,
|
||||
}
|
||||
|
||||
QUANT_BYTES_PER_PARAM = {
|
||||
@@ -50,6 +64,8 @@ QUANT_BYTES_PER_PARAM = {
|
||||
"AWQ-4bit": 0.5, "AWQ-8bit": 1.0,
|
||||
"GPTQ-Int4": 0.5, "GPTQ-Int8": 1.0,
|
||||
"mlx-4bit": 0.5, "mlx-8bit": 1.0, "mlx-6bit": 0.75,
|
||||
"FP4-MoE-Mixed": 0.55,
|
||||
"FP8-Mixed": 1.0,
|
||||
}
|
||||
|
||||
# Pre-quantized formats that should NOT go through the GGUF quant hierarchy.
|
||||
@@ -57,6 +73,7 @@ QUANT_BYTES_PER_PARAM = {
|
||||
PREQUANTIZED_PREFIXES = (
|
||||
"AWQ-", "GPTQ-", "mlx-", "FP8", "FP4", "NVFP4", "MXFP4", "NF4",
|
||||
"INT4", "INT8", "W4A16", "W8A8", "W8A16",
|
||||
"FP4-MoE-Mixed", "FP8-Mixed",
|
||||
)
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user