Files
odysseus/tests/test_serve_profiles.py
Leo 6fca7e86b7 Cookbook serve profiles and engine filter
* Cookbook: Engine filter + intelligent hardware-computed serve profiles

Two related Cookbook serving improvements for accurate, hardware-aware model
serving (especially on consumer GPUs that can only run GGUF/llama.cpp).

Engine filter
- New "Engine" dropdown (All / llama.cpp / vLLM / SGLang) beside the quant
  picker. Pure client-side view filter over the fetched list via the same
  _detectBackend() the serve commands use, so what you filter to is exactly what
  would launch. Re-renders from cache (no refetch). Empty-state message + the
  instant-cache-paint path account for it too.

Intelligent serve profiles (Quality / Balanced / Speed)
- services/hwfit/profiles.py: compute_serve_profiles() turns detected VRAM +
  model size into concrete llama.cpp flags (n_gpu_layers, n_cpu_moe, cache-type,
  context). Encodes the by-hand tuning: a too-big MoE offloads experts to CPU
  instead of failing; a model that fits stays fully on GPU; quant tracks profile
  intent; vision models keep image-encoder headroom. Reuses models.py VRAM math
  so filtering and serving agree on what fits. Pure/deterministic (no t/s claims
  — partial-offload speed isn't reliably predictable; fit is what's computed).
- /api/hwfit/profiles endpoint returns the profiles + the model's trained
  context limit, with loose name matching (strips org/ prefix, -GGUF suffix,
  quant tag) so a local GGUF folder name resolves to its catalog entry.
- _buildServeCmd (llama.cpp) now emits --n-cpu-moe / --flash-attn /
  --cache-type-k/v when set, with llama-cpp-python fallback equivalents. It
  previously only set -ngl/-c, which is why it OOM'd or ran slow.
- Serve panel: profile chips that fill the fields on click, plus CPU-MoE / KV
  Cache / Flash Attn fields. Context is clamped to the model's trained limit
  (and an absolute 1M sanity ceiling) on type/blur/profile-load and at launch —
  fixes a crash where a stale 256k/16M preset + quantized KV cache caused an
  amdgpu ErrorDeviceLost.

Tests: tests/test_serve_profiles.py (7) — offload vs full-GPU fit, never exceed
VRAM, context cap, launchable flags, vision headroom, no-GPU empty.
Checks: py_compile + node --check pass; pytest test_serve_profiles + test_hwfit_amd
green; verified live on an RDNA4 box (gfx1200) — Balanced lands ~ncm18 q4 128k,
matching hand-tuning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Cookbook: make column-header sorting discoverable (incl. Newest)

Sorting in Cookbook is via clickable column headers (pewds' design), but the
headers had no visual cue that they're interactive — so sorting in general, and
the Newest sort on the Model header specifically, was undiscoverable.

- Style sortable headers as interactive: pointer cursor, hover underline, and
  the active sort column bolded/highlighted. There was no CSS for
  .hwfit-sortable / .hwfit-sort-active at all; this helps every existing sort,
  not just Newest.
- The Model column header sorts by release_date (newest first), reusing the
  existing header-click sort wiring and the "newest" SORT_KEY.

No new sort control — uses the existing column-header paradigm.

Checks: node --check passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Cookbook serve profiles: keep the on-disk file's quant fixed (don't propose Q6/Q2)

In the Serve tab the model is a specific GGUF file already on disk, so its quant
can't change — but the profiles were suggesting "Quality · Q6_K" / "Speed · Q2_K"
as if you could re-quantize it. That's meaningless when serving a fixed file.

- compute_serve_profiles gains serve_weights_gb / serve_quant. When set (SERVE
  mode), the quant is locked to the file's and profiles differ only in the real
  serving knobs — n_cpu_moe, KV-cache type, context. _weights_gb / _cpu_moe_for_budget
  use the file's actual size instead of a quant-derived estimate. DOWNLOAD mode
  (no override) still varies the quant to show download options.
- /api/hwfit/profiles accepts serve_weights_gb & serve_quant.
- The Serve panel parses the file's size (from m.size "20.6 GB") and quant (from
  the repo/file name) and passes them, so profiles match what's actually served.

Result for a 20.6 GB Q4_K_M file: all three profiles stay Q4_K_M and differ by
KV/ctx/offload (Quality q8 KV 128k ncm21, Balanced q4 128k ncm17, Speed q4 32k
ncm15) — no nonsensical quant changes.

Tests: test_serve_mode_keeps_fixed_quant. Full serve-profile suite green (9).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Cookbook serve: Vision toggle (auto-find mmproj) + live VRAM/RAM-spillover monitor

Two serve-panel additions:

1. **Vision toggle.** A "Vision" checkbox that serves the model with its
   multimodal projector so it can read images. The mmproj path is resolved at
   runtime (find mmproj-*.gguf next to the model), so dropping an mmproj file in
   the model folder makes the toggle just work; `--mmproj … --image-max-tokens
   1024` (native) / `--clip_model_path` (llama-cpp-python) only when on + found.

2. **Live GPU-memory monitor.** A readout that polls /api/cookbook/gpus every 4s
   while the panel is open and shows VRAM used/total/%, free, and — crucially on
   a discrete card — **RAM spillover** (AMD gtt_used_mb), with a plain-language
   health hint: green/healthy, amber/tight, red/"spilled to RAM — slow (raise
   CPU MoE or lower context)". Surfaces gtt_used_mb from the gpus endpoint
   (previously read for total only and discarded for 'used').

Lets you see at a glance whether a config fits VRAM (fast) or is paging to system
RAM over PCIe (slow) instead of guessing.

Checks: node --check + py_compile pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 12:34:42 +09:00

111 lines
4.6 KiB
Python

"""Intelligent llama.cpp serve profiles computed from hardware.
Locks in that compute_serve_profiles() turns detected VRAM + model size into
sane Quality/Balanced/Speed flag sets: a too-big MoE offloads experts to CPU
(n_cpu_moe > 0) instead of failing, a model that fits stays fully on GPU
(n_cpu_moe == 0), context shrinks before giving up, and quant choice tracks the
profile intent.
"""
from services.hwfit.profiles import compute_serve_profiles
_QWEN_35B_MOE = {
"name": "Qwen3.6-35B-A3B",
"parameter_count": "35B",
"is_moe": True,
"active_parameters": 3_000_000_000,
"num_hidden_layers": 48,
}
_DENSE_8B = {
"name": "Qwen3-8B",
"parameter_count": "8B",
"is_moe": False,
"num_hidden_layers": 36,
}
def _sys(vram, family="rdna"):
return {"backend": "rocm", "gpu_vram_gb": vram, "gpu_family": family}
def test_big_moe_on_small_card_offloads_not_fails():
"""A 35B MoE can't hold its weights on 16 GB, so the Quality profile must
offload experts to CPU (n_cpu_moe > 0) rather than be dropped."""
profs = compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE)
assert profs, "expected at least one profile"
q = next(p for p in profs if p["key"] == "quality")
assert q["n_cpu_moe"] > 0
assert q["offloads"] is True
assert q["cache_type"] == "q8_0" # quality uses the sharp KV cache
assert q["est_vram_gb"] <= 16.0 # never exceeds the card
def test_profiles_never_exceed_vram():
"""Every profile's VRAM estimate must fit the detected card."""
for vram in (8.0, 12.0, 16.0, 24.0):
for p in compute_serve_profiles(_sys(vram), _QWEN_35B_MOE):
assert p["est_vram_gb"] <= vram + 0.05, (vram, p)
def test_small_model_stays_fully_on_gpu():
"""A model whose weights fit must NOT offload — n_cpu_moe == 0 everywhere."""
for p in compute_serve_profiles(_sys(15.9), _DENSE_8B):
assert p["n_cpu_moe"] == 0
assert p["offloads"] is False
def test_speed_profile_is_lighter_than_quality():
"""Speed trades quant/context for less offload than Quality."""
profs = {p["key"]: p for p in compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE)}
if "speed" in profs and "quality" in profs:
assert profs["speed"]["n_cpu_moe"] <= profs["quality"]["n_cpu_moe"]
assert profs["speed"]["ctx"] <= profs["quality"]["ctx"]
def test_flags_are_launchable():
"""Each profile must carry the concrete llama.cpp flags the cmd builder needs."""
for p in compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE):
assert p["n_gpu_layers"] == 999
assert isinstance(p["n_cpu_moe"], int) and p["n_cpu_moe"] >= 0
assert p["cache_type"] in ("q4_0", "q8_0", "f16")
assert p["ctx"] >= 8192
assert p["quant"]
def test_context_capped_at_model_limit():
"""Profiles must never propose more context than the model was trained for
— over-asking triggers a training-context overflow and, with a quantized KV
cache, a GPU OOM/device-lost crash."""
small_ctx_model = dict(_QWEN_35B_MOE, name="X", context_length=32768)
for p in compute_serve_profiles(_sys(15.9), small_ctx_model):
assert p["ctx"] <= 32768, p
def test_no_gpu_returns_empty():
"""No VRAM detected → no GPU profiles (caller falls back to manual flags)."""
assert compute_serve_profiles({"backend": "cpu_x86", "gpu_vram_gb": 0}, _QWEN_35B_MOE) == []
def test_vision_model_leaves_encoder_headroom():
"""A vision model must budget extra VRAM for the image encoder, so its
estimate leaves more slack below the card than a text model would."""
vis = dict(_QWEN_35B_MOE, name="Qwen3-VL-35B", is_multimodal=True)
for p in compute_serve_profiles(_sys(15.9), vis):
assert p["est_vram_gb"] <= 15.9 - 1.0 + 0.05 # ~1.1 GB encoder headroom
def test_serve_mode_keeps_fixed_quant():
"""Serving a specific GGUF file: the quant is fixed (the file's), so every
profile must keep it and vary only the serving knobs (KV/ctx/offload) — not
propose a different quant (which makes no sense for an on-disk file)."""
profs = compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE,
serve_weights_gb=20.6, serve_quant="Q4_K_M")
assert profs
assert all(p["quant"] == "Q4_K_M" for p in profs), [p["quant"] for p in profs]
# The knobs should still differ across profiles (KV type and/or context).
kvs = {p["cache_type"] for p in profs}
ctxs = {p["ctx"] for p in profs}
assert len(kvs) > 1 or len(ctxs) > 1, "serve profiles are identical"
# All must fit the card.
assert all(p["est_vram_gb"] <= 16.0 for p in profs)