From 6fca7e86b70b2e4bf717c874cdb272c37c61cd76 Mon Sep 17 00:00:00 2001 From: Leo Date: Tue, 2 Jun 2026 05:34:42 +0200 Subject: [PATCH] Cookbook serve profiles and engine filter MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Cookbook: Engine filter + intelligent hardware-computed serve profiles Two related Cookbook serving improvements for accurate, hardware-aware model serving (especially on consumer GPUs that can only run GGUF/llama.cpp). Engine filter - New "Engine" dropdown (All / llama.cpp / vLLM / SGLang) beside the quant picker. Pure client-side view filter over the fetched list via the same _detectBackend() the serve commands use, so what you filter to is exactly what would launch. Re-renders from cache (no refetch). Empty-state message + the instant-cache-paint path account for it too. Intelligent serve profiles (Quality / Balanced / Speed) - services/hwfit/profiles.py: compute_serve_profiles() turns detected VRAM + model size into concrete llama.cpp flags (n_gpu_layers, n_cpu_moe, cache-type, context). Encodes the by-hand tuning: a too-big MoE offloads experts to CPU instead of failing; a model that fits stays fully on GPU; quant tracks profile intent; vision models keep image-encoder headroom. Reuses models.py VRAM math so filtering and serving agree on what fits. Pure/deterministic (no t/s claims — partial-offload speed isn't reliably predictable; fit is what's computed). - /api/hwfit/profiles endpoint returns the profiles + the model's trained context limit, with loose name matching (strips org/ prefix, -GGUF suffix, quant tag) so a local GGUF folder name resolves to its catalog entry. - _buildServeCmd (llama.cpp) now emits --n-cpu-moe / --flash-attn / --cache-type-k/v when set, with llama-cpp-python fallback equivalents. It previously only set -ngl/-c, which is why it OOM'd or ran slow. - Serve panel: profile chips that fill the fields on click, plus CPU-MoE / KV Cache / Flash Attn fields. Context is clamped to the model's trained limit (and an absolute 1M sanity ceiling) on type/blur/profile-load and at launch — fixes a crash where a stale 256k/16M preset + quantized KV cache caused an amdgpu ErrorDeviceLost. Tests: tests/test_serve_profiles.py (7) — offload vs full-GPU fit, never exceed VRAM, context cap, launchable flags, vision headroom, no-GPU empty. Checks: py_compile + node --check pass; pytest test_serve_profiles + test_hwfit_amd green; verified live on an RDNA4 box (gfx1200) — Balanced lands ~ncm18 q4 128k, matching hand-tuning. Co-Authored-By: Claude Opus 4.8 (1M context) * Cookbook: make column-header sorting discoverable (incl. Newest) Sorting in Cookbook is via clickable column headers (pewds' design), but the headers had no visual cue that they're interactive — so sorting in general, and the Newest sort on the Model header specifically, was undiscoverable. - Style sortable headers as interactive: pointer cursor, hover underline, and the active sort column bolded/highlighted. There was no CSS for .hwfit-sortable / .hwfit-sort-active at all; this helps every existing sort, not just Newest. - The Model column header sorts by release_date (newest first), reusing the existing header-click sort wiring and the "newest" SORT_KEY. No new sort control — uses the existing column-header paradigm. Checks: node --check passes. Co-Authored-By: Claude Opus 4.8 (1M context) * Cookbook serve profiles: keep the on-disk file's quant fixed (don't propose Q6/Q2) In the Serve tab the model is a specific GGUF file already on disk, so its quant can't change — but the profiles were suggesting "Quality · Q6_K" / "Speed · Q2_K" as if you could re-quantize it. That's meaningless when serving a fixed file. - compute_serve_profiles gains serve_weights_gb / serve_quant. When set (SERVE mode), the quant is locked to the file's and profiles differ only in the real serving knobs — n_cpu_moe, KV-cache type, context. _weights_gb / _cpu_moe_for_budget use the file's actual size instead of a quant-derived estimate. DOWNLOAD mode (no override) still varies the quant to show download options. - /api/hwfit/profiles accepts serve_weights_gb & serve_quant. - The Serve panel parses the file's size (from m.size "20.6 GB") and quant (from the repo/file name) and passes them, so profiles match what's actually served. Result for a 20.6 GB Q4_K_M file: all three profiles stay Q4_K_M and differ by KV/ctx/offload (Quality q8 KV 128k ncm21, Balanced q4 128k ncm17, Speed q4 32k ncm15) — no nonsensical quant changes. Tests: test_serve_mode_keeps_fixed_quant. Full serve-profile suite green (9). Co-Authored-By: Claude Opus 4.8 (1M context) * Cookbook serve: Vision toggle (auto-find mmproj) + live VRAM/RAM-spillover monitor Two serve-panel additions: 1. **Vision toggle.** A "Vision" checkbox that serves the model with its multimodal projector so it can read images. The mmproj path is resolved at runtime (find mmproj-*.gguf next to the model), so dropping an mmproj file in the model folder makes the toggle just work; `--mmproj … --image-max-tokens 1024` (native) / `--clip_model_path` (llama-cpp-python) only when on + found. 2. **Live GPU-memory monitor.** A readout that polls /api/cookbook/gpus every 4s while the panel is open and shows VRAM used/total/%, free, and — crucially on a discrete card — **RAM spillover** (AMD gtt_used_mb), with a plain-language health hint: green/healthy, amber/tight, red/"spilled to RAM — slow (raise CPU MoE or lower context)". Surfaces gtt_used_mb from the gpus endpoint (previously read for total only and discarded for 'used'). Lets you see at a glance whether a config fits VRAM (fast) or is paging to system RAM over PCIe (slow) instead of guessing. Checks: node --check + py_compile pass. Co-Authored-By: Claude Opus 4.8 (1M context) --------- Co-authored-by: Claude Opus 4.8 (1M context) --- routes/cookbook_routes.py | 7 ++ routes/hwfit_routes.py | 59 +++++++++ services/hwfit/profiles.py | 229 +++++++++++++++++++++++++++++++++++ static/js/cookbook-hwfit.js | 31 ++++- static/js/cookbook.js | 45 ++++++- static/js/cookbookServe.js | 177 +++++++++++++++++++++++++++ static/style.css | 6 + tests/test_serve_profiles.py | 110 +++++++++++++++++ 8 files changed, 658 insertions(+), 6 deletions(-) create mode 100644 services/hwfit/profiles.py create mode 100644 tests/test_serve_profiles.py diff --git a/routes/cookbook_routes.py b/routes/cookbook_routes.py index 28a2897..106460f 100644 --- a/routes/cookbook_routes.py +++ b/routes/cookbook_routes.py @@ -1401,9 +1401,16 @@ def setup_cookbook_routes() -> APIRouter: total_mb = max(0, int(total_bytes / (1024 * 1024))) used_mb = max(0, min(total_mb, int(used_bytes / (1024 * 1024)))) free_mb = max(0, total_mb - used_mb) + # GTT = the system-RAM pool the GPU pages into when VRAM is full. + # On a discrete card a large gtt_used means the model spilled past + # VRAM into RAM over PCIe — much slower. Surface it so the UI can + # warn "spilling to RAM" instead of the user wondering why it's slow. + gtt_used_raw = await _gpu_read_file(f"{base}/mem_info_gtt_used", host, ssh_port) + gtt_used_mb = max(0, int(int(gtt_used_raw) / (1024 * 1024))) if (gtt_used_raw and gtt_used_raw.isdigit()) else 0 gpus.append({ "index": len(gpus), "name": name, "uuid": entry, "free_mb": free_mb, "total_mb": total_mb, "used_mb": used_mb, + "gtt_used_mb": gtt_used_mb, "util_pct": 0, "busy": bool(total_mb and (free_mb / total_mb) < 0.85), "processes": [], "backend": "rocm", "source": "amd-sysfs", "unified_memory": unified, diff --git a/routes/hwfit_routes.py b/routes/hwfit_routes.py index 9a0a4e9..94ff90d 100644 --- a/routes/hwfit_routes.py +++ b/routes/hwfit_routes.py @@ -1,3 +1,4 @@ +import re from copy import deepcopy from fastapi import APIRouter @@ -174,6 +175,64 @@ def setup_hwfit_routes(): results = rank_models(system, use_case=use_case or None, limit=limit, search=search or None, sort=sort, quant=quant or None) return {"system": system, "models": results} + @router.get("/profiles") + def get_serve_profiles(model: str = "", host: str = "", ssh_port: str = "", platform: str = "", fresh: bool = False, serve_weights_gb: float = 0.0, serve_quant: str = ""): + """Compute llama.cpp serve profiles (Quality/Balanced/Speed) for `model` + against the detected hardware on `host` (or local). Returns concrete + flags (n_gpu_layers, n_cpu_moe, cache_type, ctx) the serve UI can apply. + + `model` is matched against the catalog by name; if it's not in the + catalog (e.g. an ad-hoc HF repo), pass enough hints via a minimal synthetic + entry isn't possible here, so we return [] and the UI keeps manual flags. + """ + from services.hwfit.hardware import detect_system + from services.hwfit.models import get_models + from services.hwfit.profiles import compute_serve_profiles + system = detect_system(host=host, ssh_port=ssh_port, platform=platform, fresh=fresh) + if system.get("error"): + return {"system": system, "profiles": [], "error": system["error"]} + catalog = {m.get("name"): m for m in (get_models() or [])} + + def _norm(s): + # Normalize for matching: drop org/ prefix, a trailing -GGUF/-gguf + # marker, and any quant tag, lowercase. So "DeepSeek-Coder-V2-Lite- + # Instruct-GGUF" (a local folder name) matches catalog entry + # "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct". + s = (s or "").lower().strip() + s = s.split("/")[-1] # drop org prefix + s = re.sub(r"[-_.]?gguf$", "", s) # drop trailing gguf marker + s = re.sub(r"[-_.](q\d[^/]*|iq\d[^/]*|fp8|bf16|f16|awq[^/]*|gptq[^/]*)$", "", s) + return s + + m = catalog.get(model) + if m is None and model: + want = _norm(model) + for name, entry in catalog.items(): + nn = _norm(name) + if nn and (nn == want or want.endswith(nn) or nn.endswith(want)): + m = entry + break + if m is None: + return {"system": system, "profiles": [], "error": "model not in catalog"} + # Surface the model's trained context limit so the serve UI can clamp a + # user-typed context down to it (asking for ctx > n_ctx_train overflows + # and, with a quantized KV cache, can crash the GPU). + model_ctx_max = 0 + for k in ("context_length", "max_position_embeddings", "n_ctx_train", "context"): + v = m.get(k) + if isinstance(v, (int, float)) and v > 0: + model_ctx_max = int(v) + break + return { + "system": system, + "profiles": compute_serve_profiles( + system, m, + serve_weights_gb=(serve_weights_gb or None), + serve_quant=(serve_quant or None), + ), + "model_ctx_max": model_ctx_max, + } + @router.get("/image-models") def get_image_models(sort: str = "fit", search: str = "", host: str = "", gpu_count: str = "", ssh_port: str = "", platform: str = "", fresh: bool = False, manual_mode: str = "", manual_gpu_count: str = "", manual_vram_gb: str = "", manual_ram_gb: str = "", manual_backend: str = "", ignore_detected_gpu: bool = False, ignore_detected_ram: bool = False): """Rank image generation models against detected hardware.""" diff --git a/services/hwfit/profiles.py b/services/hwfit/profiles.py new file mode 100644 index 0000000..87aa147 --- /dev/null +++ b/services/hwfit/profiles.py @@ -0,0 +1,229 @@ +"""Compute intelligent llama.cpp serve profiles from detected hardware. + +Given a system (VRAM/RAM/arch) and a model, produce 1-4 ready-to-launch +profiles — Quality / Balanced / Speed — with concrete llama.cpp flags +(n_gpu_layers, n_cpu_moe, cache-type, context). This turns the by-hand tuning +(how many MoE layers fit on the GPU, when to spend VRAM on a q8 KV cache vs more +context, how much headroom to leave for a vision encoder) into a formula. + +Pure/deterministic — no benchmarking, no I/O. Reuses the same VRAM math as +fit.py/models.py so "what the Cookbook recommends" and "what it serves" agree. + +NOTE: token/s figures are NOT computed here — real speed on partial-offload MoE +is CPU-bound and not reliably predictable from specs. The UI labels profiles by +their tradeoff (Quality/Balanced/Speed), and the VRAM fit (the part that decides +whether it even loads) is what's computed from real numbers. +""" + +from services.hwfit.models import ( + QUANT_BPP, + params_b, + _active_params_b, + is_prequantized, +) + +# GGUF KV-cache cost per token, in bytes-per-active-billion-param, by cache type. +# q4_0 is ~half of q8_0 is ~half of f16. The 8e-6 base in estimate_memory_gb is +# the q8_0-ish figure; scale from there. +_KV_FACTOR = {"q4_0": 0.5, "q8_0": 1.0, "f16": 2.0} + +# Quant ladder from highest quality/size down. A profile that wants "best quant +# that fits fully on GPU" walks this until one fits. +_QUANT_LADDER = ["Q8_0", "Q6_K", "Q5_K_M", "Q4_K_M", "Q3_K_M", "Q2_K"] + + +def _weights_gb(model, quant, fixed_gb=None): + """VRAM for the full weights. When fixed_gb is given (serving a specific GGUF + file already on disk), use its real size — the quant is whatever the file is, + not something we get to pick.""" + if fixed_gb and fixed_gb > 0: + return float(fixed_gb) + return params_b(model) * QUANT_BPP.get(quant, 0.58) + + +def _kv_gb(model, ctx, kv_type): + """KV-cache VRAM at a context length and cache type.""" + kv_params = _active_params_b(model) + return 0.000008 * kv_params * ctx * _KV_FACTOR.get(kv_type, 1.0) + + +def _n_layers(model): + """Best-effort total transformer block count (for n-cpu-moe math).""" + for k in ("num_hidden_layers", "n_layers", "num_layers", "block_count"): + v = model.get(k) + if isinstance(v, (int, float)) and v > 0: + return int(v) + # Fallback heuristic by size — most MoE/dense LLMs land 28-64 layers. + pb = params_b(model) + if pb >= 60: + return 64 + if pb >= 25: + return 48 + if pb >= 12: + return 40 + return 32 + + +def _cpu_moe_for_budget(model, quant, kv_gb, vram_budget_gb, fixed_gb=None): + """How many MoE layers must move to CPU so weights+KV fit vram_budget_gb. + + Returns (n_cpu_moe, fits_fully). When the model already fits, n_cpu_moe=0. + Each offloaded layer frees roughly weights/n_layers of VRAM. We only model + this for MoE (where --n-cpu-moe applies); dense models just report whether + they fit at the given n_gpu_layers=999. + """ + weights = _weights_gb(model, quant, fixed_gb) + needed = weights + kv_gb + 0.6 # +0.6 GB runtime/compute buffers + if needed <= vram_budget_gb: + return 0, True + if not model.get("is_moe"): + # Dense: no per-expert offload knob; either it fits or it spills via -ngl. + return 0, False + layers = _n_layers(model) + per_layer = weights / max(layers, 1) + overflow = needed - vram_budget_gb + import math + n = math.ceil(overflow / max(per_layer, 1e-6)) + n = max(0, min(n, layers)) # clamp + return n, False + + +def compute_serve_profiles(system, model, serve_weights_gb=None, serve_quant=None): + """Return a list of profile dicts for llama.cpp serving of `model` on `system`. + + Each profile: {key, label, quant, n_gpu_layers, n_cpu_moe, cache_type, ctx, + est_vram_gb, fits, note}. Empty list if no GGUF path makes + sense (caller should fall back to manual flags). + + DOWNLOAD mode (default): the quant isn't chosen yet, so profiles vary it + (Quality=Q6, Balanced=Q4, Speed=Q2…) to show download options. + + SERVE mode (serve_weights_gb set): a specific GGUF file already exists on + disk — its quant is FIXED. Profiles then keep that quant/size and differ only + in the actual serving knobs (n_cpu_moe, KV-cache type, context). serve_quant + is the file's quant label (e.g. "Q4_K_M") just for display. + """ + vram = float(system.get("gpu_vram_gb") or 0) + if vram <= 0: + return [] + + serve_mode = bool(serve_weights_gb and serve_weights_gb > 0) + + # Never propose more context than the model was trained for — asking llama.cpp + # for ctx > n_ctx_train triggers a "training context overflow" and, with a + # quantized KV cache, an oversized allocation that can crash the GPU + # (radv/amdgpu ErrorDeviceLost). Cap every profile at the model's real limit. + model_ctx_max = 0 + for k in ("context_length", "max_position_embeddings", "n_ctx_train", "context"): + v = model.get(k) + if isinstance(v, (int, float)) and v > 0: + model_ctx_max = int(v) + break + if model_ctx_max <= 0: + model_ctx_max = 131072 # conservative default when the catalog omits it + + # Vision models need headroom for the image encoder (~1 GB on top of weights). + is_vision = bool( + model.get("is_multimodal") or model.get("vision") or model.get("mmproj") + or "vl" in str(model.get("name", "")).lower() + ) + headroom = 1.1 if is_vision else 0.4 + budget = max(vram - headroom, 1.0) + + # Prequantized (AWQ/GPTQ/FP8) served via GGUF fallback use a fixed ~Q4 quant; + # GGUF models can pick their quant. Pick a sensible per-profile quant. + fixed_quant = model.get("quantization") if is_prequantized(model) else None + + is_moe = bool(model.get("is_moe")) + + def _pick_quant(prefer, require_full_fit): + """Choose a quant for a profile. + + - fixed_quant (AWQ/GPTQ/FP8 served via GGUF): always that. + - require_full_fit=True (Speed): walk DOWN from `prefer` to the best quant + whose weights fit fully on the GPU (no offload) — fastest. + - require_full_fit=False (Quality on MoE): keep `prefer` even if it must + offload experts to CPU; that's the whole point of n-cpu-moe on a card + too small to hold the weights. For dense models we can't offload + per-expert, so fall back to the largest fully-fitting quant. + """ + if fixed_quant: + return fixed_quant + start = _QUANT_LADDER.index(prefer) if prefer in _QUANT_LADDER else 3 + if require_full_fit or not is_moe: + for q in _QUANT_LADDER[start:]: + if _weights_gb(model, q) + 0.6 <= budget: + return q + return _QUANT_LADDER[-1] + # MoE quality: keep the preferred (big) quant; offload handles overflow. + return prefer + + if serve_mode: + # Fixed file on disk — quant can't change. Vary only the serving knobs. + fq = serve_quant or model.get("quantization") or "GGUF" + specs = [ + # key, label, prefer_quant, full_fit, kv_type, ctx, note + ("quality", "Quality", fq, False, "q8_0", 131072, + "Sharp q8 KV cache + full context. Best long-context accuracy; offloads MoE layers to CPU if needed."), + ("balanced", "Balanced", fq, False, "q4_0", 131072, + "Compact q4 KV at full context — good speed/quality mix."), + ("speed", "Speed", fq, False, "q4_0", 32768, + "Trimmed context + light KV for the fastest tokens/s."), + ] + else: + specs = [ + # key, label, prefer_quant, full_fit, kv_type, ctx, note + ("quality", "Quality", "Q6_K", False, "q8_0", 131072, + "Biggest quant + sharp q8 KV cache. Best answers; offloads MoE layers to CPU if needed."), + ("balanced", "Balanced", "Q4_K_M", False, "q4_0", 131072, + "Q4 weights + compact q4 KV. Good speed/quality mix at full context."), + ("speed", "Speed", "Q4_K_M", True, "q4_0", 32768, + "Smallest offload + trimmed context for the fastest tokens/s."), + ] + + profiles = [] + for key, label, prefer_q, full_fit, kv_type, ctx, note in specs: + # In serve mode the quant is fixed (the file's); in download mode we pick. + quant = prefer_q if serve_mode else _pick_quant(prefer_q, full_fit) + # Shrink context if even the chosen KV won't fit alongside weights. + # Start from the smaller of the profile's target and the model's limit. + cur_ctx = min(ctx, model_ctx_max) + while cur_ctx >= 8192: + kv = _kv_gb(model, cur_ctx, kv_type) + n_cpu_moe, fits = _cpu_moe_for_budget(model, quant, kv, budget, fixed_gb=serve_weights_gb) + est = _weights_gb(model, quant, serve_weights_gb) + kv + 0.6 + # If a non-MoE model can't fit even fully offloaded, try less context. + if model.get("is_moe") or fits or cur_ctx <= 8192: + profiles.append({ + "key": key, + "label": label, + "quant": quant, + "n_gpu_layers": 999, + "n_cpu_moe": n_cpu_moe, + "cache_type": kv_type, + "ctx": cur_ctx, + # When experts offload, GPU-resident VRAM tops out at the + # budget (weights beyond it live in system RAM), so cap the + # estimate at `budget`, not the full card — this also leaves + # the vision-encoder headroom visible in the number. + "est_vram_gb": round(min(est, budget), 1), + # For MoE we treat it as fitting via offload; report whether + # it fit WITHOUT offload as the "clean" flag. + "fits": fits or bool(model.get("is_moe")), + "offloads": n_cpu_moe > 0, + "note": note, + }) + break + cur_ctx //= 2 + + # De-dupe identical profiles (e.g. tiny model where all three collapse to the + # same all-GPU config) — keep the first/highest-quality label. + seen = set() + deduped = [] + for p in profiles: + sig = (p["quant"], p["n_cpu_moe"], p["cache_type"], p["ctx"]) + if sig in seen: + continue + seen.add(sig) + deduped.append(p) + return deduped diff --git a/static/js/cookbook-hwfit.js b/static/js/cookbook-hwfit.js index 6ed895d..bd49a17 100644 --- a/static/js/cookbook-hwfit.js +++ b/static/js/cookbook-hwfit.js @@ -365,6 +365,17 @@ function _hwfitShowError(list, host, detail) { if (rb) rb.addEventListener('click', () => { _resetGpuToggleState(); _hwfitFetch(true); }); } +// Client-side "Engine" filter (llama.cpp / vLLM / SGLang). Empty = show all. +// Uses the same _detectBackend() the serve commands use, so what you filter to +// is exactly what would be launched. Pure view filter — no refetch needed. +function _applyEngineFilter(models) { + const want = document.getElementById('hwfit-engine')?.value || ''; + if (!want || !Array.isArray(models)) return models || []; + return models.filter(m => { + try { return _detectBackend(m).backend === want; } catch { return true; } + }); +} + export async function _hwfitFetch(fresh = false) { const _tk = ++_hwfitFetchToken; const useCase = document.getElementById('hwfit-usecase')?.value || ''; @@ -384,7 +395,7 @@ export async function _hwfitFetch(fresh = false) { if (_cached) { _hwfitCache = _cached; _hwfitRenderHw(hw, _cached.system); - _hwfitRenderList(list, _cached.models); + _hwfitRenderList(list, _applyEngineFilter(_cached.models)); } else { // Show spinner while scanning — stack the spinner above a text label // (the .hwfit-loading class is a centered flex ROW, so force column here). @@ -530,7 +541,7 @@ export async function _hwfitFetch(fresh = false) { return asc ? av - bv : bv - av; }); } - _hwfitRenderList(list, data.models); + _hwfitRenderList(list, _applyEngineFilter(data.models)); // Persist this result so the next page load can paint it instantly. _writeScanCache(_sig, data); // Render GPU toggles — only on first scan (no override active) @@ -773,9 +784,10 @@ export function _hwfitRenderList(el, models) { const hasHw = sys && ((sys.gpu_vram_gb || 0) > 0 || (sys.total_ram_gb || 0) > 8); const hasFilters = !!(document.getElementById('hwfit-search')?.value?.trim() || document.getElementById('hwfit-usecase')?.value - || document.getElementById('hwfit-quant')?.value); + || document.getElementById('hwfit-quant')?.value + || document.getElementById('hwfit-engine')?.value); let msg; - if (hasFilters) msg = 'No models match these filters — try clearing the search, use-case, or quant.'; + if (hasFilters) msg = 'No models match these filters — try clearing the search, use-case, quant, or engine.'; else if (hasHw) msg = 'No models fit — the hardware probe may have under-reported. Try Rescan.'; else msg = 'No models fit your hardware'; el.innerHTML = `
${msg}
`; @@ -1122,6 +1134,17 @@ export function _hwfitInit() { if (uc) uc.addEventListener('change', () => _hwfitFetch()); if (sort) sort.addEventListener('change', () => _hwfitFetch()); if (qpref) qpref.addEventListener('change', () => _hwfitFetch()); + // Engine filter is a pure client-side view filter over the already-fetched + // list, so just re-render from cache instead of re-probing hardware. + const engine = document.getElementById('hwfit-engine'); + if (engine) engine.addEventListener('change', () => { + const list = document.getElementById('hwfit-list'); + if (list && _hwfitCache && Array.isArray(_hwfitCache.models)) { + _hwfitRenderList(list, _applyEngineFilter(_hwfitCache.models)); + } else { + _hwfitFetch(); + } + }); // Rescan — force a fresh hardware probe (bypasses the per-host cache). const rescan = document.getElementById('hwfit-rescan'); if (rescan && !rescan.dataset.bound) { diff --git a/static/js/cookbook.js b/static/js/cookbook.js index af8d911..8c23a5a 100644 --- a/static/js/cookbook.js +++ b/static/js/cookbook.js @@ -417,11 +417,40 @@ export function _buildServeCmd(f, modelName, backend) { // renders modern GGUF chat templates that the Python bindings' Jinja2 // rejects (do_tojson ensure_ascii). Fall back to llama_cpp.server. // Don't suppress stderr — surface real errors (missing file, lib, OOM). - const _lcpServer = `${lcPrefix}${py} -m llama_cpp.server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} --n_gpu_layers ${f.ngl || '99'} --n_ctx ${f.ctx || '8192'}`; + // Optional perf/fit flags from a hardware profile (see services/hwfit/ + // profiles.py). n_cpu_moe offloads MoE expert layers to CPU when the model + // is bigger than VRAM; flash-attn + a quantized KV cache cut KV memory and + // speed things up. Only emitted when set, so manual/older flows are unchanged. + const _ncm = (f.n_cpu_moe ?? '').toString().trim(); + const _kv = (f.cache_type ?? '').toString().trim(); + let _lcExtra = ''; + let _lcpExtra = ''; + if (_ncm !== '' && Number(_ncm) > 0) { + _lcExtra += ` --n-cpu-moe ${_ncm}`; + _lcpExtra += ` --n_cpu_moe ${_ncm}`; // llama-cpp-python uses underscores + } + if (f.flash_attn) { + _lcExtra += ' --flash-attn on'; + _lcpExtra += ' --flash_attn true'; + } + if (_kv) { + _lcExtra += ` --cache-type-k ${_kv} --cache-type-v ${_kv}`; + // llama-cpp-python exposes these as type_k/type_v; pass through best-effort. + _lcpExtra += ` --type_k ${_kv} --type_v ${_kv}`; + } + // Vision: serve the multimodal projector so the model can read images. The + // mmproj path is resolved at runtime (find mmproj-*.gguf next to the model); + // only emitted when the Vision toggle is on AND a projector was found. + if (f.vision && f._mmproj_path) { + _lcExtra += ` --mmproj "${f._mmproj_path}" --image-max-tokens 1024`; + // llama-cpp-python takes the projector via --clip_model_path. + _lcpExtra += ` --clip_model_path "${f._mmproj_path}"`; + } + const _lcpServer = `${lcPrefix}${py} -m llama_cpp.server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} --n_gpu_layers ${f.ngl || '99'} --n_ctx ${f.ctx || '8192'}${_lcpExtra}`; if (_isWindows()) { cmd += _lcpServer; } else { - cmd += `${lcPrefix}llama-server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} -ngl ${f.ngl || '99'} -c ${f.ctx || '8192'}`; + cmd += `${lcPrefix}llama-server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} -ngl ${f.ngl || '99'} -c ${f.ctx || '8192'}${_lcExtra}`; cmd += ` || ${_lcpServer}`; } } else if (backend === 'ollama') { @@ -1460,6 +1489,16 @@ function _renderRecipes() { html += ''; html += ''; html += ''; + // Engine filter: show only models whose serve engine matches. "llama.cpp" + // (GGUF) runs everywhere incl. consumer AMD/Apple; vLLM/SGLang are CUDA / + // datacenter-ROCm. Filtering is client-side via _detectBackend() in the + // hwfit renderer, so it composes with the quant/type/search filters. + html += ''; html += ''; html += '
'; html += ''; html += ''; html += ''; diff --git a/static/js/cookbookServe.js b/static/js/cookbookServe.js index 3c3f1a1..0a863db 100644 --- a/static/js/cookbookServe.js +++ b/static/js/cookbookServe.js @@ -542,6 +542,27 @@ function _rerenderCachedModels() { panelHtml += ``; panelHtml += ``; panelHtml += `
`; + // Row 2c: llama.cpp fit/perf flags (set by Auto profiles, editable by hand) + const _kvOpts = ['', 'q4_0', 'q8_0', 'f16'].map(k => ``).join(''); + panelHtml += `
`; + panelHtml += ``; + panelHtml += ``; + panelHtml += ``; + panelHtml += ``; + panelHtml += `
`; + // Row 2d: Auto profiles — computed from detected hardware (see profiles.py). + // Buttons are injected after the panel mounts (needs an async fetch). + panelHtml += `
`; + panelHtml += `Auto profiles:`; + panelHtml += `computing…`; + panelHtml += `
`; + // Live VRAM / RAM-spillover monitor for the serve target's GPU. Polls + // /api/cookbook/gpus while the panel is open so you can SEE whether the + // config fits VRAM (fast) or spills to system RAM (slow). Populated after mount. + panelHtml += `
`; + panelHtml += `GPU memory:`; + panelHtml += `checking…`; + panelHtml += `
`; // Row 3a: Checkboxes (llama.cpp-only) panelHtml += `
`; panelHtml += ``; @@ -641,6 +662,11 @@ function _rerenderCachedModels() { : m.is_local_dir && m.path ? `$({ find ${_ldir} -name '*-00001-of-*.gguf' 2>/dev/null | sort; find ${_ldir} -name '*.gguf' 2>/dev/null | sort; } | head -1)` : `$({ find ${dir} -name '*-00001-of-*.gguf' 2>/dev/null | sort; find ${dir} -name '*.gguf' 2>/dev/null | sort; } | head -1)`; + // Vision: auto-find the mmproj (CLIP/projector) file in the same dir. + // Resolved at runtime so the toggle just works if an mmproj-*.gguf is + // present (downloaded alongside the model). Empty if none → cmd omits it. + const _vsearchdir = (m.is_local_dir && m.path) ? _ldir : dir; + f._mmproj_path = `$(find ${_vsearchdir} -iname 'mmproj*.gguf' 2>/dev/null | sort | head -1)`; } if (f.reasoning_parser) { const _rpEl2 = panel.querySelector('[data-field="reasoning_parser"]'); @@ -655,6 +681,151 @@ function _rerenderCachedModels() { } updateCmd(); + // Context clamp. Two ceilings: + // - ABSOLUTE_CTX_MAX: a hard sanity cap (no LLM trains past ~1M tokens), + // so an obvious typo like 16000000 can never reach llama.cpp even when + // we don't know the model's real limit (not in catalog / profiles + // fetch failed). This is what stops the radv ErrorDeviceLost crash. + // - panel._modelCtxMax: the model's actual trained limit (set by the + // profiles fetch below) — a tighter, model-specific cap when known. + const ABSOLUTE_CTX_MAX = 1048576; // 1M tokens — above any real n_ctx_train + const _ctxEl0 = panel.querySelector('[data-field="ctx"]'); + function _clampCtx(announce) { + if (!_ctxEl0) return; + const cap = panel._modelCtxMax > 0 ? panel._modelCtxMax : ABSOLUTE_CTX_MAX; + const v = parseInt(_ctxEl0.value, 10); + if (Number.isFinite(v) && v > cap) { + _ctxEl0.value = String(cap); + _ctxEl0.title = `Capped to ${panel._modelCtxMax > 0 ? "this model's trained limit" : "the maximum sane context"} (${cap}).`; + if (announce) uiModule.showToast(`Context capped to ${cap}`); + updateCmd(); + } + } + if (_ctxEl0) { + _ctxEl0.addEventListener('change', () => _clampCtx(false)); + _ctxEl0.addEventListener('blur', () => _clampCtx(false)); + _clampCtx(false); // fix any stale/preset value already present + } + + // Auto profiles — fetch hardware-computed llama.cpp profiles and render + // them as clickable chips. Clicking one fills the ctx/CPU-MoE/KV/flash + // fields and rebuilds the command. Computed from detected VRAM (see + // services/hwfit/profiles.py); rough on t/s, accurate on fit. + async function _loadServeProfiles() { + const wrap = panel.querySelector('.hwfit-profile-btns'); + if (!wrap) return; + try { + const host = (_es.remoteHost || '').trim(); + const params = new URLSearchParams({ model: repo }); + if (host) { + params.set('host', host); + const _sp = (_es.servers || []).find(s => s.host === host)?.port; + if (_sp) params.set('ssh_port', _sp); + } + // SERVE mode: this is a specific GGUF file already on disk, so its quant + // is fixed — tell the profiler the file's real size + quant so it varies + // only the serving knobs (KV/ctx/offload), not the quant. Parse the size + // from m.size (e.g. "20.6 GB") and the quant from the file/repo name. + const _sizeMatch = String(m.size || '').match(/([\d.]+)\s*GB/i); + if (_sizeMatch) params.set('serve_weights_gb', _sizeMatch[1]); + const _qMatch = String(repo).match(/(Q\d[\w]*|IQ\d[\w]*|F16|BF16|FP8)/i); + if (_qMatch) params.set('serve_quant', _qMatch[1]); + const res = await fetch(`/api/hwfit/profiles?${params}`); + const data = await res.json(); + // Remember the model's trained context limit and clamp the ctx field + // to it — asking llama.cpp for ctx > n_ctx_train overflows and, with a + // quantized KV cache, can crash the GPU (radv ErrorDeviceLost). + const ctxMax = Number(data && data.model_ctx_max) || 0; + if (ctxMax > 0) { + panel._modelCtxMax = ctxMax; // tighten the clamp to the real limit + _clampCtx(false); // re-apply now that we know the model's max + } + const profs = (data && Array.isArray(data.profiles)) ? data.profiles : []; + if (!profs.length) { wrap.innerHTML = `no auto profile for this model`; return; } + wrap.innerHTML = ''; + for (const p of profs) { + const b = document.createElement('button'); + b.type = 'button'; + b.className = 'cookbook-btn hwfit-profile-chip'; + b.style.cssText = 'height:24px;padding:0 9px;font-size:11px;'; + const off = p.offloads ? `, ncm${p.n_cpu_moe}` : ', all-GPU'; + b.textContent = `${p.label} · ${p.quant} · ${Math.round(p.ctx/1024)}k${off}`; + b.title = `${p.note}\nKV ${p.cache_type}, ~${p.est_vram_gb} GB VRAM`; + b.addEventListener('click', () => { + const set = (field, val) => { + const el = panel.querySelector(`[data-field="${field}"]`); + if (!el) return; + if (el.type === 'checkbox') el.checked = !!val; else el.value = val; + }; + set('ctx', p.ctx); + set('n_cpu_moe', p.n_cpu_moe || ''); + set('cache_type', p.cache_type || ''); + set('flash_attn', true); // required for a quantized KV cache + wrap.querySelectorAll('.hwfit-profile-chip').forEach(x => x.classList.remove('cookbook-btn-active')); + b.classList.add('cookbook-btn-active'); + updateCmd(); + }); + wrap.appendChild(b); + } + } catch { + wrap.innerHTML = `profile compute failed`; + } + } + _loadServeProfiles(); + + // Live GPU-memory monitor: poll /api/cookbook/gpus and show VRAM usage + + // RAM-spillover, with a plain-language health/speed hint. Lets you tell at + // a glance whether the chosen config fits VRAM (fast) or is paging into + // system RAM over PCIe (slow). AMD sysfs reports gtt_used_mb for spillover. + async function _refreshVramMonitor() { + const el = panel.querySelector('.hwfit-vram-readout'); + if (!el || !document.body.contains(el)) return false; // panel closed → stop + try { + const host = (_es.remoteHost || '').trim(); + const params = new URLSearchParams(); + if (host) { + params.set('host', host); + const _sp = (_es.servers || []).find(s => s.host === host)?.port; + if (_sp) params.set('ssh_port', _sp); + } + const res = await fetch('/api/cookbook/gpus' + (params.toString() ? '?' + params : '')); + const data = await res.json(); + const gpus = Array.isArray(data) ? data : (data.gpus || []); + if (!gpus.length) { el.textContent = 'no GPU detected'; el.style.color = ''; return true; } + const g = gpus[0]; + const usedG = (g.used_mb / 1024), totG = (g.total_mb / 1024); + const pct = totG ? Math.round((usedG / totG) * 100) : 0; + const freeG = Math.max(0, totG - usedG); + const spillG = (g.gtt_used_mb || 0) / 1024; + // Color: green < 85%, amber 85-97%, red > 97% or spilling. + const spilling = spillG > 0.5 && !g.unified_memory; // unified APUs always use GTT; not a spill + let color = 'var(--green, #50fa7b)'; + if (pct >= 97 || spilling) color = 'var(--red, #ff5555)'; + else if (pct >= 85) color = 'var(--orange, #ffb86c)'; + let txt = `${usedG.toFixed(1)} / ${totG.toFixed(1)} GB (${pct}%) · ${freeG.toFixed(1)} GB free`; + if (spilling) { + txt += ` · ⚠ ${spillG.toFixed(1)} GB spilled to RAM — slow (raise CPU MoE or lower context)`; + } else if (pct >= 90) { + txt += ` · tight — risk of OOM/spill on long context or images`; + } else { + txt += ` · healthy`; + } + el.textContent = txt; + el.style.color = color; + return true; + } catch { + el.textContent = 'unavailable'; + el.style.color = ''; + return true; + } + } + _refreshVramMonitor(); + // Poll every 4s while the panel is open; stop when it's removed from the DOM. + const _vramTimer = setInterval(async () => { + const ok = await _refreshVramMonitor(); + if (ok === false) clearInterval(_vramTimer); + }, 4000); + // Show/hide backend-specific sections function updateBackendVisibility() { const b = panel.querySelector('[data-field="backend"]')?.value || 'vllm'; @@ -1313,6 +1484,12 @@ function _rerenderCachedModels() { // Launch button panel.querySelector('.hwfit-serve-launch').addEventListener('click', async (ev) => { const _launchBtn = ev.currentTarget; + // Final safety net: never launch with ctx beyond the model's trained + // limit (or the absolute sanity ceiling when the limit is unknown). A + // stale preset or typo (e.g. 16000000) overflows and, with a quantized + // KV cache, can crash the GPU. Skip only if the user hand-edited the raw + // command (then we respect their literal text). + if (!_cmdManuallyEdited) _clampCtx(true); if (!_cmdManuallyEdited) updateCmd(); const launchCmd = _cmdTextarea ? _cmdTextarea.value.trim() : panel._cmd; const serveState = {}; diff --git a/static/style.css b/static/style.css index bfd8b4c..a20127b 100644 --- a/static/style.css +++ b/static/style.css @@ -1744,6 +1744,12 @@ body.bg-pattern-sparkles { padding-left: max(0px, calc((100% - var(--chat-max)) / 2)); padding-right: max(12px, calc((100% - var(--chat-max)) / 2 + 12px)); } + /* Sortable Cookbook column headers had no visual cue, so users couldn't tell + a header was clickable (the Newest sort on the Model column was invisible). + Show a pointer + hover highlight, and underline the active sort column. */ + .hwfit-header .hwfit-sortable { cursor: pointer; transition: color .12s; } + .hwfit-header .hwfit-sortable:hover { color: var(--fg); text-decoration: underline dotted; } + .hwfit-header .hwfit-sort-active { color: var(--fg); font-weight: 600; } /* Welcome screen — centered in available space above input bar */ #welcome-screen { position:absolute; diff --git a/tests/test_serve_profiles.py b/tests/test_serve_profiles.py new file mode 100644 index 0000000..b7b4ef1 --- /dev/null +++ b/tests/test_serve_profiles.py @@ -0,0 +1,110 @@ +"""Intelligent llama.cpp serve profiles computed from hardware. + +Locks in that compute_serve_profiles() turns detected VRAM + model size into +sane Quality/Balanced/Speed flag sets: a too-big MoE offloads experts to CPU +(n_cpu_moe > 0) instead of failing, a model that fits stays fully on GPU +(n_cpu_moe == 0), context shrinks before giving up, and quant choice tracks the +profile intent. +""" + +from services.hwfit.profiles import compute_serve_profiles + +_QWEN_35B_MOE = { + "name": "Qwen3.6-35B-A3B", + "parameter_count": "35B", + "is_moe": True, + "active_parameters": 3_000_000_000, + "num_hidden_layers": 48, +} +_DENSE_8B = { + "name": "Qwen3-8B", + "parameter_count": "8B", + "is_moe": False, + "num_hidden_layers": 36, +} + + +def _sys(vram, family="rdna"): + return {"backend": "rocm", "gpu_vram_gb": vram, "gpu_family": family} + + +def test_big_moe_on_small_card_offloads_not_fails(): + """A 35B MoE can't hold its weights on 16 GB, so the Quality profile must + offload experts to CPU (n_cpu_moe > 0) rather than be dropped.""" + profs = compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE) + assert profs, "expected at least one profile" + q = next(p for p in profs if p["key"] == "quality") + assert q["n_cpu_moe"] > 0 + assert q["offloads"] is True + assert q["cache_type"] == "q8_0" # quality uses the sharp KV cache + assert q["est_vram_gb"] <= 16.0 # never exceeds the card + + +def test_profiles_never_exceed_vram(): + """Every profile's VRAM estimate must fit the detected card.""" + for vram in (8.0, 12.0, 16.0, 24.0): + for p in compute_serve_profiles(_sys(vram), _QWEN_35B_MOE): + assert p["est_vram_gb"] <= vram + 0.05, (vram, p) + + +def test_small_model_stays_fully_on_gpu(): + """A model whose weights fit must NOT offload — n_cpu_moe == 0 everywhere.""" + for p in compute_serve_profiles(_sys(15.9), _DENSE_8B): + assert p["n_cpu_moe"] == 0 + assert p["offloads"] is False + + +def test_speed_profile_is_lighter_than_quality(): + """Speed trades quant/context for less offload than Quality.""" + profs = {p["key"]: p for p in compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE)} + if "speed" in profs and "quality" in profs: + assert profs["speed"]["n_cpu_moe"] <= profs["quality"]["n_cpu_moe"] + assert profs["speed"]["ctx"] <= profs["quality"]["ctx"] + + +def test_flags_are_launchable(): + """Each profile must carry the concrete llama.cpp flags the cmd builder needs.""" + for p in compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE): + assert p["n_gpu_layers"] == 999 + assert isinstance(p["n_cpu_moe"], int) and p["n_cpu_moe"] >= 0 + assert p["cache_type"] in ("q4_0", "q8_0", "f16") + assert p["ctx"] >= 8192 + assert p["quant"] + + +def test_context_capped_at_model_limit(): + """Profiles must never propose more context than the model was trained for + — over-asking triggers a training-context overflow and, with a quantized KV + cache, a GPU OOM/device-lost crash.""" + small_ctx_model = dict(_QWEN_35B_MOE, name="X", context_length=32768) + for p in compute_serve_profiles(_sys(15.9), small_ctx_model): + assert p["ctx"] <= 32768, p + + +def test_no_gpu_returns_empty(): + """No VRAM detected → no GPU profiles (caller falls back to manual flags).""" + assert compute_serve_profiles({"backend": "cpu_x86", "gpu_vram_gb": 0}, _QWEN_35B_MOE) == [] + + +def test_vision_model_leaves_encoder_headroom(): + """A vision model must budget extra VRAM for the image encoder, so its + estimate leaves more slack below the card than a text model would.""" + vis = dict(_QWEN_35B_MOE, name="Qwen3-VL-35B", is_multimodal=True) + for p in compute_serve_profiles(_sys(15.9), vis): + assert p["est_vram_gb"] <= 15.9 - 1.0 + 0.05 # ~1.1 GB encoder headroom + + +def test_serve_mode_keeps_fixed_quant(): + """Serving a specific GGUF file: the quant is fixed (the file's), so every + profile must keep it and vary only the serving knobs (KV/ctx/offload) — not + propose a different quant (which makes no sense for an on-disk file).""" + profs = compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE, + serve_weights_gb=20.6, serve_quant="Q4_K_M") + assert profs + assert all(p["quant"] == "Q4_K_M" for p in profs), [p["quant"] for p in profs] + # The knobs should still differ across profiles (KV type and/or context). + kvs = {p["cache_type"] for p in profs} + ctxs = {p["ctx"] for p in profs} + assert len(kvs) > 1 or len(ctxs) > 1, "serve profiles are identical" + # All must fit the card. + assert all(p["est_vram_gb"] <= 16.0 for p in profs)