Cookbook serve profiles and engine filter

* Cookbook: Engine filter + intelligent hardware-computed serve profiles Two related Cookbook serving improvements for accurate, hardware-aware model serving (especially on consumer GPUs that can only run GGUF/llama.cpp). Engine filter - New "Engine" dropdown (All / llama.cpp / vLLM / SGLang) beside the quant picker. Pure client-side view filter over the fetched list via the same _detectBackend() the serve commands use, so what you filter to is exactly what would launch. Re-renders from cache (no refetch). Empty-state message + the instant-cache-paint path account for it too. Intelligent serve profiles (Quality / Balanced / Speed) - services/hwfit/profiles.py: compute_serve_profiles() turns detected VRAM + model size into concrete llama.cpp flags (n_gpu_layers, n_cpu_moe, cache-type, context). Encodes the by-hand tuning: a too-big MoE offloads experts to CPU instead of failing; a model that fits stays fully on GPU; quant tracks profile intent; vision models keep image-encoder headroom. Reuses models.py VRAM math so filtering and serving agree on what fits. Pure/deterministic (no t/s claims — partial-offload speed isn't reliably predictable; fit is what's computed). - /api/hwfit/profiles endpoint returns the profiles + the model's trained context limit, with loose name matching (strips org/ prefix, -GGUF suffix, quant tag) so a local GGUF folder name resolves to its catalog entry. - _buildServeCmd (llama.cpp) now emits --n-cpu-moe / --flash-attn / --cache-type-k/v when set, with llama-cpp-python fallback equivalents. It previously only set -ngl/-c, which is why it OOM'd or ran slow. - Serve panel: profile chips that fill the fields on click, plus CPU-MoE / KV Cache / Flash Attn fields. Context is clamped to the model's trained limit (and an absolute 1M sanity ceiling) on type/blur/profile-load and at launch — fixes a crash where a stale 256k/16M preset + quantized KV cache caused an amdgpu ErrorDeviceLost. Tests: tests/test_serve_profiles.py (7) — offload vs full-GPU fit, never exceed VRAM, context cap, launchable flags, vision headroom, no-GPU empty. Checks: py_compile + node --check pass; pytest test_serve_profiles + test_hwfit_amd green; verified live on an RDNA4 box (gfx1200) — Balanced lands ~ncm18 q4 128k, matching hand-tuning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Cookbook: make column-header sorting discoverable (incl. Newest) Sorting in Cookbook is via clickable column headers (pewds' design), but the headers had no visual cue that they're interactive — so sorting in general, and the Newest sort on the Model header specifically, was undiscoverable. - Style sortable headers as interactive: pointer cursor, hover underline, and the active sort column bolded/highlighted. There was no CSS for .hwfit-sortable / .hwfit-sort-active at all; this helps every existing sort, not just Newest. - The Model column header sorts by release_date (newest first), reusing the existing header-click sort wiring and the "newest" SORT_KEY. No new sort control — uses the existing column-header paradigm. Checks: node --check passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Cookbook serve profiles: keep the on-disk file's quant fixed (don't propose Q6/Q2) In the Serve tab the model is a specific GGUF file already on disk, so its quant can't change — but the profiles were suggesting "Quality · Q6_K" / "Speed · Q2_K" as if you could re-quantize it. That's meaningless when serving a fixed file. - compute_serve_profiles gains serve_weights_gb / serve_quant. When set (SERVE mode), the quant is locked to the file's and profiles differ only in the real serving knobs — n_cpu_moe, KV-cache type, context. _weights_gb / _cpu_moe_for_budget use the file's actual size instead of a quant-derived estimate. DOWNLOAD mode (no override) still varies the quant to show download options. - /api/hwfit/profiles accepts serve_weights_gb & serve_quant. - The Serve panel parses the file's size (from m.size "20.6 GB") and quant (from the repo/file name) and passes them, so profiles match what's actually served. Result for a 20.6 GB Q4_K_M file: all three profiles stay Q4_K_M and differ by KV/ctx/offload (Quality q8 KV 128k ncm21, Balanced q4 128k ncm17, Speed q4 32k ncm15) — no nonsensical quant changes. Tests: test_serve_mode_keeps_fixed_quant. Full serve-profile suite green (9). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Cookbook serve: Vision toggle (auto-find mmproj) + live VRAM/RAM-spillover monitor Two serve-panel additions: 1. **Vision toggle.** A "Vision" checkbox that serves the model with its multimodal projector so it can read images. The mmproj path is resolved at runtime (find mmproj-*.gguf next to the model), so dropping an mmproj file in the model folder makes the toggle just work; `--mmproj … --image-max-tokens 1024` (native) / `--clip_model_path` (llama-cpp-python) only when on + found. 2. **Live GPU-memory monitor.** A readout that polls /api/cookbook/gpus every 4s while the panel is open and shows VRAM used/total/%, free, and — crucially on a discrete card — **RAM spillover** (AMD gtt_used_mb), with a plain-language health hint: green/healthy, amber/tight, red/"spilled to RAM — slow (raise CPU MoE or lower context)". Surfaces gtt_used_mb from the gpus endpoint (previously read for total only and discarded for 'used'). Lets you see at a glance whether a config fits VRAM (fast) or is paging to system RAM over PCIe (slow) instead of guessing. Checks: node --check + py_compile pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 05:34:42 +02:00
parent 8b3c0d8ad4
commit 6fca7e86b7
8 changed files with 658 additions and 6 deletions
--- a/routes/cookbook_routes.py
+++ b/routes/cookbook_routes.py
@@ -1401,9 +1401,16 @@ def setup_cookbook_routes() -> APIRouter:
            total_mb = max(0, int(total_bytes / (1024 * 1024)))
            used_mb = max(0, min(total_mb, int(used_bytes / (1024 * 1024))))
            free_mb = max(0, total_mb - used_mb)
+            # GTT = the system-RAM pool the GPU pages into when VRAM is full.
+            # On a discrete card a large gtt_used means the model spilled past
+            # VRAM into RAM over PCIe — much slower. Surface it so the UI can
+            # warn "spilling to RAM" instead of the user wondering why it's slow.
+            gtt_used_raw = await _gpu_read_file(f"{base}/mem_info_gtt_used", host, ssh_port)
+            gtt_used_mb = max(0, int(int(gtt_used_raw) / (1024 * 1024))) if (gtt_used_raw and gtt_used_raw.isdigit()) else 0
            gpus.append({
                "index": len(gpus), "name": name, "uuid": entry,
                "free_mb": free_mb, "total_mb": total_mb, "used_mb": used_mb,
+                "gtt_used_mb": gtt_used_mb,
                "util_pct": 0, "busy": bool(total_mb and (free_mb / total_mb) < 0.85),
                "processes": [], "backend": "rocm", "source": "amd-sysfs",
                "unified_memory": unified,
--- a/routes/hwfit_routes.py
+++ b/routes/hwfit_routes.py
@@ -1,3 +1,4 @@
+import re
 from copy import deepcopy

 from fastapi import APIRouter
@@ -174,6 +175,64 @@ def setup_hwfit_routes():
        results = rank_models(system, use_case=use_case or None, limit=limit, search=search or None, sort=sort, quant=quant or None)
        return {"system": system, "models": results}

+    @router.get("/profiles")
+    def get_serve_profiles(model: str = "", host: str = "", ssh_port: str = "", platform: str = "", fresh: bool = False, serve_weights_gb: float = 0.0, serve_quant: str = ""):
+        """Compute llama.cpp serve profiles (Quality/Balanced/Speed) for `model`
+        against the detected hardware on `host` (or local). Returns concrete
+        flags (n_gpu_layers, n_cpu_moe, cache_type, ctx) the serve UI can apply.
+
+        `model` is matched against the catalog by name; if it's not in the
+        catalog (e.g. an ad-hoc HF repo), pass enough hints via a minimal synthetic
+        entry isn't possible here, so we return [] and the UI keeps manual flags.
+        """
+        from services.hwfit.hardware import detect_system
+        from services.hwfit.models import get_models
+        from services.hwfit.profiles import compute_serve_profiles
+        system = detect_system(host=host, ssh_port=ssh_port, platform=platform, fresh=fresh)
+        if system.get("error"):
+            return {"system": system, "profiles": [], "error": system["error"]}
+        catalog = {m.get("name"): m for m in (get_models() or [])}
+
+        def _norm(s):
+            # Normalize for matching: drop org/ prefix, a trailing -GGUF/-gguf
+            # marker, and any quant tag, lowercase. So "DeepSeek-Coder-V2-Lite-
+            # Instruct-GGUF" (a local folder name) matches catalog entry
+            # "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct".
+            s = (s or "").lower().strip()
+            s = s.split("/")[-1]                     # drop org prefix
+            s = re.sub(r"[-_.]?gguf$", "", s)        # drop trailing gguf marker
+            s = re.sub(r"[-_.](q\d[^/]*|iq\d[^/]*|fp8|bf16|f16|awq[^/]*|gptq[^/]*)$", "", s)
+            return s
+
+        m = catalog.get(model)
+        if m is None and model:
+            want = _norm(model)
+            for name, entry in catalog.items():
+                nn = _norm(name)
+                if nn and (nn == want or want.endswith(nn) or nn.endswith(want)):
+                    m = entry
+                    break
+        if m is None:
+            return {"system": system, "profiles": [], "error": "model not in catalog"}
+        # Surface the model's trained context limit so the serve UI can clamp a
+        # user-typed context down to it (asking for ctx > n_ctx_train overflows
+        # and, with a quantized KV cache, can crash the GPU).
+        model_ctx_max = 0
+        for k in ("context_length", "max_position_embeddings", "n_ctx_train", "context"):
+            v = m.get(k)
+            if isinstance(v, (int, float)) and v > 0:
+                model_ctx_max = int(v)
+                break
+        return {
+            "system": system,
+            "profiles": compute_serve_profiles(
+                system, m,
+                serve_weights_gb=(serve_weights_gb or None),
+                serve_quant=(serve_quant or None),
+            ),
+            "model_ctx_max": model_ctx_max,
+        }
+
    @router.get("/image-models")
    def get_image_models(sort: str = "fit", search: str = "", host: str = "", gpu_count: str = "", ssh_port: str = "", platform: str = "", fresh: bool = False, manual_mode: str = "", manual_gpu_count: str = "", manual_vram_gb: str = "", manual_ram_gb: str = "", manual_backend: str = "", ignore_detected_gpu: bool = False, ignore_detected_ram: bool = False):
        """Rank image generation models against detected hardware."""