Cookbook serve profiles and engine filter

* Cookbook: Engine filter + intelligent hardware-computed serve profiles

Two related Cookbook serving improvements for accurate, hardware-aware model
serving (especially on consumer GPUs that can only run GGUF/llama.cpp).

Engine filter
- New "Engine" dropdown (All / llama.cpp / vLLM / SGLang) beside the quant
  picker. Pure client-side view filter over the fetched list via the same
  _detectBackend() the serve commands use, so what you filter to is exactly what
  would launch. Re-renders from cache (no refetch). Empty-state message + the
  instant-cache-paint path account for it too.

Intelligent serve profiles (Quality / Balanced / Speed)
- services/hwfit/profiles.py: compute_serve_profiles() turns detected VRAM +
  model size into concrete llama.cpp flags (n_gpu_layers, n_cpu_moe, cache-type,
  context). Encodes the by-hand tuning: a too-big MoE offloads experts to CPU
  instead of failing; a model that fits stays fully on GPU; quant tracks profile
  intent; vision models keep image-encoder headroom. Reuses models.py VRAM math
  so filtering and serving agree on what fits. Pure/deterministic (no t/s claims
  — partial-offload speed isn't reliably predictable; fit is what's computed).
- /api/hwfit/profiles endpoint returns the profiles + the model's trained
  context limit, with loose name matching (strips org/ prefix, -GGUF suffix,
  quant tag) so a local GGUF folder name resolves to its catalog entry.
- _buildServeCmd (llama.cpp) now emits --n-cpu-moe / --flash-attn /
  --cache-type-k/v when set, with llama-cpp-python fallback equivalents. It
  previously only set -ngl/-c, which is why it OOM'd or ran slow.
- Serve panel: profile chips that fill the fields on click, plus CPU-MoE / KV
  Cache / Flash Attn fields. Context is clamped to the model's trained limit
  (and an absolute 1M sanity ceiling) on type/blur/profile-load and at launch —
  fixes a crash where a stale 256k/16M preset + quantized KV cache caused an
  amdgpu ErrorDeviceLost.

Tests: tests/test_serve_profiles.py (7) — offload vs full-GPU fit, never exceed
VRAM, context cap, launchable flags, vision headroom, no-GPU empty.
Checks: py_compile + node --check pass; pytest test_serve_profiles + test_hwfit_amd
green; verified live on an RDNA4 box (gfx1200) — Balanced lands ~ncm18 q4 128k,
matching hand-tuning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Cookbook: make column-header sorting discoverable (incl. Newest)

Sorting in Cookbook is via clickable column headers (pewds' design), but the
headers had no visual cue that they're interactive — so sorting in general, and
the Newest sort on the Model header specifically, was undiscoverable.

- Style sortable headers as interactive: pointer cursor, hover underline, and
  the active sort column bolded/highlighted. There was no CSS for
  .hwfit-sortable / .hwfit-sort-active at all; this helps every existing sort,
  not just Newest.
- The Model column header sorts by release_date (newest first), reusing the
  existing header-click sort wiring and the "newest" SORT_KEY.

No new sort control — uses the existing column-header paradigm.

Checks: node --check passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Cookbook serve profiles: keep the on-disk file's quant fixed (don't propose Q6/Q2)

In the Serve tab the model is a specific GGUF file already on disk, so its quant
can't change — but the profiles were suggesting "Quality · Q6_K" / "Speed · Q2_K"
as if you could re-quantize it. That's meaningless when serving a fixed file.

- compute_serve_profiles gains serve_weights_gb / serve_quant. When set (SERVE
  mode), the quant is locked to the file's and profiles differ only in the real
  serving knobs — n_cpu_moe, KV-cache type, context. _weights_gb / _cpu_moe_for_budget
  use the file's actual size instead of a quant-derived estimate. DOWNLOAD mode
  (no override) still varies the quant to show download options.
- /api/hwfit/profiles accepts serve_weights_gb & serve_quant.
- The Serve panel parses the file's size (from m.size "20.6 GB") and quant (from
  the repo/file name) and passes them, so profiles match what's actually served.

Result for a 20.6 GB Q4_K_M file: all three profiles stay Q4_K_M and differ by
KV/ctx/offload (Quality q8 KV 128k ncm21, Balanced q4 128k ncm17, Speed q4 32k
ncm15) — no nonsensical quant changes.

Tests: test_serve_mode_keeps_fixed_quant. Full serve-profile suite green (9).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Cookbook serve: Vision toggle (auto-find mmproj) + live VRAM/RAM-spillover monitor

Two serve-panel additions:

1. **Vision toggle.** A "Vision" checkbox that serves the model with its
   multimodal projector so it can read images. The mmproj path is resolved at
   runtime (find mmproj-*.gguf next to the model), so dropping an mmproj file in
   the model folder makes the toggle just work; `--mmproj … --image-max-tokens
   1024` (native) / `--clip_model_path` (llama-cpp-python) only when on + found.

2. **Live GPU-memory monitor.** A readout that polls /api/cookbook/gpus every 4s
   while the panel is open and shows VRAM used/total/%, free, and — crucially on
   a discrete card — **RAM spillover** (AMD gtt_used_mb), with a plain-language
   health hint: green/healthy, amber/tight, red/"spilled to RAM — slow (raise
   CPU MoE or lower context)". Surfaces gtt_used_mb from the gpus endpoint
   (previously read for total only and discarded for 'used').

Lets you see at a glance whether a config fits VRAM (fast) or is paging to system
RAM over PCIe (slow) instead of guessing.

Checks: node --check + py_compile pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Leo
2026-06-02 05:34:42 +02:00
committed by GitHub
parent 8b3c0d8ad4
commit 6fca7e86b7
8 changed files with 658 additions and 6 deletions

View File

@@ -1401,9 +1401,16 @@ def setup_cookbook_routes() -> APIRouter:
total_mb = max(0, int(total_bytes / (1024 * 1024)))
used_mb = max(0, min(total_mb, int(used_bytes / (1024 * 1024))))
free_mb = max(0, total_mb - used_mb)
# GTT = the system-RAM pool the GPU pages into when VRAM is full.
# On a discrete card a large gtt_used means the model spilled past
# VRAM into RAM over PCIe — much slower. Surface it so the UI can
# warn "spilling to RAM" instead of the user wondering why it's slow.
gtt_used_raw = await _gpu_read_file(f"{base}/mem_info_gtt_used", host, ssh_port)
gtt_used_mb = max(0, int(int(gtt_used_raw) / (1024 * 1024))) if (gtt_used_raw and gtt_used_raw.isdigit()) else 0
gpus.append({
"index": len(gpus), "name": name, "uuid": entry,
"free_mb": free_mb, "total_mb": total_mb, "used_mb": used_mb,
"gtt_used_mb": gtt_used_mb,
"util_pct": 0, "busy": bool(total_mb and (free_mb / total_mb) < 0.85),
"processes": [], "backend": "rocm", "source": "amd-sysfs",
"unified_memory": unified,

View File

@@ -1,3 +1,4 @@
import re
from copy import deepcopy
from fastapi import APIRouter
@@ -174,6 +175,64 @@ def setup_hwfit_routes():
results = rank_models(system, use_case=use_case or None, limit=limit, search=search or None, sort=sort, quant=quant or None)
return {"system": system, "models": results}
@router.get("/profiles")
def get_serve_profiles(model: str = "", host: str = "", ssh_port: str = "", platform: str = "", fresh: bool = False, serve_weights_gb: float = 0.0, serve_quant: str = ""):
"""Compute llama.cpp serve profiles (Quality/Balanced/Speed) for `model`
against the detected hardware on `host` (or local). Returns concrete
flags (n_gpu_layers, n_cpu_moe, cache_type, ctx) the serve UI can apply.
`model` is matched against the catalog by name; if it's not in the
catalog (e.g. an ad-hoc HF repo), pass enough hints via a minimal synthetic
entry isn't possible here, so we return [] and the UI keeps manual flags.
"""
from services.hwfit.hardware import detect_system
from services.hwfit.models import get_models
from services.hwfit.profiles import compute_serve_profiles
system = detect_system(host=host, ssh_port=ssh_port, platform=platform, fresh=fresh)
if system.get("error"):
return {"system": system, "profiles": [], "error": system["error"]}
catalog = {m.get("name"): m for m in (get_models() or [])}
def _norm(s):
# Normalize for matching: drop org/ prefix, a trailing -GGUF/-gguf
# marker, and any quant tag, lowercase. So "DeepSeek-Coder-V2-Lite-
# Instruct-GGUF" (a local folder name) matches catalog entry
# "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct".
s = (s or "").lower().strip()
s = s.split("/")[-1] # drop org prefix
s = re.sub(r"[-_.]?gguf$", "", s) # drop trailing gguf marker
s = re.sub(r"[-_.](q\d[^/]*|iq\d[^/]*|fp8|bf16|f16|awq[^/]*|gptq[^/]*)$", "", s)
return s
m = catalog.get(model)
if m is None and model:
want = _norm(model)
for name, entry in catalog.items():
nn = _norm(name)
if nn and (nn == want or want.endswith(nn) or nn.endswith(want)):
m = entry
break
if m is None:
return {"system": system, "profiles": [], "error": "model not in catalog"}
# Surface the model's trained context limit so the serve UI can clamp a
# user-typed context down to it (asking for ctx > n_ctx_train overflows
# and, with a quantized KV cache, can crash the GPU).
model_ctx_max = 0
for k in ("context_length", "max_position_embeddings", "n_ctx_train", "context"):
v = m.get(k)
if isinstance(v, (int, float)) and v > 0:
model_ctx_max = int(v)
break
return {
"system": system,
"profiles": compute_serve_profiles(
system, m,
serve_weights_gb=(serve_weights_gb or None),
serve_quant=(serve_quant or None),
),
"model_ctx_max": model_ctx_max,
}
@router.get("/image-models")
def get_image_models(sort: str = "fit", search: str = "", host: str = "", gpu_count: str = "", ssh_port: str = "", platform: str = "", fresh: bool = False, manual_mode: str = "", manual_gpu_count: str = "", manual_vram_gb: str = "", manual_ram_gb: str = "", manual_backend: str = "", ignore_detected_gpu: bool = False, ignore_detected_ram: bool = False):
"""Rank image generation models against detected hardware."""

229
services/hwfit/profiles.py Normal file
View File

@@ -0,0 +1,229 @@
"""Compute intelligent llama.cpp serve profiles from detected hardware.
Given a system (VRAM/RAM/arch) and a model, produce 1-4 ready-to-launch
profiles — Quality / Balanced / Speed — with concrete llama.cpp flags
(n_gpu_layers, n_cpu_moe, cache-type, context). This turns the by-hand tuning
(how many MoE layers fit on the GPU, when to spend VRAM on a q8 KV cache vs more
context, how much headroom to leave for a vision encoder) into a formula.
Pure/deterministic — no benchmarking, no I/O. Reuses the same VRAM math as
fit.py/models.py so "what the Cookbook recommends" and "what it serves" agree.
NOTE: token/s figures are NOT computed here — real speed on partial-offload MoE
is CPU-bound and not reliably predictable from specs. The UI labels profiles by
their tradeoff (Quality/Balanced/Speed), and the VRAM fit (the part that decides
whether it even loads) is what's computed from real numbers.
"""
from services.hwfit.models import (
QUANT_BPP,
params_b,
_active_params_b,
is_prequantized,
)
# GGUF KV-cache cost per token, in bytes-per-active-billion-param, by cache type.
# q4_0 is ~half of q8_0 is ~half of f16. The 8e-6 base in estimate_memory_gb is
# the q8_0-ish figure; scale from there.
_KV_FACTOR = {"q4_0": 0.5, "q8_0": 1.0, "f16": 2.0}
# Quant ladder from highest quality/size down. A profile that wants "best quant
# that fits fully on GPU" walks this until one fits.
_QUANT_LADDER = ["Q8_0", "Q6_K", "Q5_K_M", "Q4_K_M", "Q3_K_M", "Q2_K"]
def _weights_gb(model, quant, fixed_gb=None):
"""VRAM for the full weights. When fixed_gb is given (serving a specific GGUF
file already on disk), use its real size — the quant is whatever the file is,
not something we get to pick."""
if fixed_gb and fixed_gb > 0:
return float(fixed_gb)
return params_b(model) * QUANT_BPP.get(quant, 0.58)
def _kv_gb(model, ctx, kv_type):
"""KV-cache VRAM at a context length and cache type."""
kv_params = _active_params_b(model)
return 0.000008 * kv_params * ctx * _KV_FACTOR.get(kv_type, 1.0)
def _n_layers(model):
"""Best-effort total transformer block count (for n-cpu-moe math)."""
for k in ("num_hidden_layers", "n_layers", "num_layers", "block_count"):
v = model.get(k)
if isinstance(v, (int, float)) and v > 0:
return int(v)
# Fallback heuristic by size — most MoE/dense LLMs land 28-64 layers.
pb = params_b(model)
if pb >= 60:
return 64
if pb >= 25:
return 48
if pb >= 12:
return 40
return 32
def _cpu_moe_for_budget(model, quant, kv_gb, vram_budget_gb, fixed_gb=None):
"""How many MoE layers must move to CPU so weights+KV fit vram_budget_gb.
Returns (n_cpu_moe, fits_fully). When the model already fits, n_cpu_moe=0.
Each offloaded layer frees roughly weights/n_layers of VRAM. We only model
this for MoE (where --n-cpu-moe applies); dense models just report whether
they fit at the given n_gpu_layers=999.
"""
weights = _weights_gb(model, quant, fixed_gb)
needed = weights + kv_gb + 0.6 # +0.6 GB runtime/compute buffers
if needed <= vram_budget_gb:
return 0, True
if not model.get("is_moe"):
# Dense: no per-expert offload knob; either it fits or it spills via -ngl.
return 0, False
layers = _n_layers(model)
per_layer = weights / max(layers, 1)
overflow = needed - vram_budget_gb
import math
n = math.ceil(overflow / max(per_layer, 1e-6))
n = max(0, min(n, layers)) # clamp
return n, False
def compute_serve_profiles(system, model, serve_weights_gb=None, serve_quant=None):
"""Return a list of profile dicts for llama.cpp serving of `model` on `system`.
Each profile: {key, label, quant, n_gpu_layers, n_cpu_moe, cache_type, ctx,
est_vram_gb, fits, note}. Empty list if no GGUF path makes
sense (caller should fall back to manual flags).
DOWNLOAD mode (default): the quant isn't chosen yet, so profiles vary it
(Quality=Q6, Balanced=Q4, Speed=Q2…) to show download options.
SERVE mode (serve_weights_gb set): a specific GGUF file already exists on
disk — its quant is FIXED. Profiles then keep that quant/size and differ only
in the actual serving knobs (n_cpu_moe, KV-cache type, context). serve_quant
is the file's quant label (e.g. "Q4_K_M") just for display.
"""
vram = float(system.get("gpu_vram_gb") or 0)
if vram <= 0:
return []
serve_mode = bool(serve_weights_gb and serve_weights_gb > 0)
# Never propose more context than the model was trained for — asking llama.cpp
# for ctx > n_ctx_train triggers a "training context overflow" and, with a
# quantized KV cache, an oversized allocation that can crash the GPU
# (radv/amdgpu ErrorDeviceLost). Cap every profile at the model's real limit.
model_ctx_max = 0
for k in ("context_length", "max_position_embeddings", "n_ctx_train", "context"):
v = model.get(k)
if isinstance(v, (int, float)) and v > 0:
model_ctx_max = int(v)
break
if model_ctx_max <= 0:
model_ctx_max = 131072 # conservative default when the catalog omits it
# Vision models need headroom for the image encoder (~1 GB on top of weights).
is_vision = bool(
model.get("is_multimodal") or model.get("vision") or model.get("mmproj")
or "vl" in str(model.get("name", "")).lower()
)
headroom = 1.1 if is_vision else 0.4
budget = max(vram - headroom, 1.0)
# Prequantized (AWQ/GPTQ/FP8) served via GGUF fallback use a fixed ~Q4 quant;
# GGUF models can pick their quant. Pick a sensible per-profile quant.
fixed_quant = model.get("quantization") if is_prequantized(model) else None
is_moe = bool(model.get("is_moe"))
def _pick_quant(prefer, require_full_fit):
"""Choose a quant for a profile.
- fixed_quant (AWQ/GPTQ/FP8 served via GGUF): always that.
- require_full_fit=True (Speed): walk DOWN from `prefer` to the best quant
whose weights fit fully on the GPU (no offload) — fastest.
- require_full_fit=False (Quality on MoE): keep `prefer` even if it must
offload experts to CPU; that's the whole point of n-cpu-moe on a card
too small to hold the weights. For dense models we can't offload
per-expert, so fall back to the largest fully-fitting quant.
"""
if fixed_quant:
return fixed_quant
start = _QUANT_LADDER.index(prefer) if prefer in _QUANT_LADDER else 3
if require_full_fit or not is_moe:
for q in _QUANT_LADDER[start:]:
if _weights_gb(model, q) + 0.6 <= budget:
return q
return _QUANT_LADDER[-1]
# MoE quality: keep the preferred (big) quant; offload handles overflow.
return prefer
if serve_mode:
# Fixed file on disk — quant can't change. Vary only the serving knobs.
fq = serve_quant or model.get("quantization") or "GGUF"
specs = [
# key, label, prefer_quant, full_fit, kv_type, ctx, note
("quality", "Quality", fq, False, "q8_0", 131072,
"Sharp q8 KV cache + full context. Best long-context accuracy; offloads MoE layers to CPU if needed."),
("balanced", "Balanced", fq, False, "q4_0", 131072,
"Compact q4 KV at full context — good speed/quality mix."),
("speed", "Speed", fq, False, "q4_0", 32768,
"Trimmed context + light KV for the fastest tokens/s."),
]
else:
specs = [
# key, label, prefer_quant, full_fit, kv_type, ctx, note
("quality", "Quality", "Q6_K", False, "q8_0", 131072,
"Biggest quant + sharp q8 KV cache. Best answers; offloads MoE layers to CPU if needed."),
("balanced", "Balanced", "Q4_K_M", False, "q4_0", 131072,
"Q4 weights + compact q4 KV. Good speed/quality mix at full context."),
("speed", "Speed", "Q4_K_M", True, "q4_0", 32768,
"Smallest offload + trimmed context for the fastest tokens/s."),
]
profiles = []
for key, label, prefer_q, full_fit, kv_type, ctx, note in specs:
# In serve mode the quant is fixed (the file's); in download mode we pick.
quant = prefer_q if serve_mode else _pick_quant(prefer_q, full_fit)
# Shrink context if even the chosen KV won't fit alongside weights.
# Start from the smaller of the profile's target and the model's limit.
cur_ctx = min(ctx, model_ctx_max)
while cur_ctx >= 8192:
kv = _kv_gb(model, cur_ctx, kv_type)
n_cpu_moe, fits = _cpu_moe_for_budget(model, quant, kv, budget, fixed_gb=serve_weights_gb)
est = _weights_gb(model, quant, serve_weights_gb) + kv + 0.6
# If a non-MoE model can't fit even fully offloaded, try less context.
if model.get("is_moe") or fits or cur_ctx <= 8192:
profiles.append({
"key": key,
"label": label,
"quant": quant,
"n_gpu_layers": 999,
"n_cpu_moe": n_cpu_moe,
"cache_type": kv_type,
"ctx": cur_ctx,
# When experts offload, GPU-resident VRAM tops out at the
# budget (weights beyond it live in system RAM), so cap the
# estimate at `budget`, not the full card — this also leaves
# the vision-encoder headroom visible in the number.
"est_vram_gb": round(min(est, budget), 1),
# For MoE we treat it as fitting via offload; report whether
# it fit WITHOUT offload as the "clean" flag.
"fits": fits or bool(model.get("is_moe")),
"offloads": n_cpu_moe > 0,
"note": note,
})
break
cur_ctx //= 2
# De-dupe identical profiles (e.g. tiny model where all three collapse to the
# same all-GPU config) — keep the first/highest-quality label.
seen = set()
deduped = []
for p in profiles:
sig = (p["quant"], p["n_cpu_moe"], p["cache_type"], p["ctx"])
if sig in seen:
continue
seen.add(sig)
deduped.append(p)
return deduped

View File

@@ -365,6 +365,17 @@ function _hwfitShowError(list, host, detail) {
if (rb) rb.addEventListener('click', () => { _resetGpuToggleState(); _hwfitFetch(true); });
}
// Client-side "Engine" filter (llama.cpp / vLLM / SGLang). Empty = show all.
// Uses the same _detectBackend() the serve commands use, so what you filter to
// is exactly what would be launched. Pure view filter — no refetch needed.
function _applyEngineFilter(models) {
const want = document.getElementById('hwfit-engine')?.value || '';
if (!want || !Array.isArray(models)) return models || [];
return models.filter(m => {
try { return _detectBackend(m).backend === want; } catch { return true; }
});
}
export async function _hwfitFetch(fresh = false) {
const _tk = ++_hwfitFetchToken;
const useCase = document.getElementById('hwfit-usecase')?.value || '';
@@ -384,7 +395,7 @@ export async function _hwfitFetch(fresh = false) {
if (_cached) {
_hwfitCache = _cached;
_hwfitRenderHw(hw, _cached.system);
_hwfitRenderList(list, _cached.models);
_hwfitRenderList(list, _applyEngineFilter(_cached.models));
} else {
// Show spinner while scanning — stack the spinner above a text label
// (the .hwfit-loading class is a centered flex ROW, so force column here).
@@ -530,7 +541,7 @@ export async function _hwfitFetch(fresh = false) {
return asc ? av - bv : bv - av;
});
}
_hwfitRenderList(list, data.models);
_hwfitRenderList(list, _applyEngineFilter(data.models));
// Persist this result so the next page load can paint it instantly.
_writeScanCache(_sig, data);
// Render GPU toggles — only on first scan (no override active)
@@ -773,9 +784,10 @@ export function _hwfitRenderList(el, models) {
const hasHw = sys && ((sys.gpu_vram_gb || 0) > 0 || (sys.total_ram_gb || 0) > 8);
const hasFilters = !!(document.getElementById('hwfit-search')?.value?.trim()
|| document.getElementById('hwfit-usecase')?.value
|| document.getElementById('hwfit-quant')?.value);
|| document.getElementById('hwfit-quant')?.value
|| document.getElementById('hwfit-engine')?.value);
let msg;
if (hasFilters) msg = 'No models match these filters — try clearing the search, use-case, or quant.';
if (hasFilters) msg = 'No models match these filters — try clearing the search, use-case, quant, or engine.';
else if (hasHw) msg = 'No models fit — the hardware probe may have under-reported. Try Rescan.';
else msg = 'No models fit your hardware';
el.innerHTML = `<div class="hwfit-loading">${msg}</div>`;
@@ -1122,6 +1134,17 @@ export function _hwfitInit() {
if (uc) uc.addEventListener('change', () => _hwfitFetch());
if (sort) sort.addEventListener('change', () => _hwfitFetch());
if (qpref) qpref.addEventListener('change', () => _hwfitFetch());
// Engine filter is a pure client-side view filter over the already-fetched
// list, so just re-render from cache instead of re-probing hardware.
const engine = document.getElementById('hwfit-engine');
if (engine) engine.addEventListener('change', () => {
const list = document.getElementById('hwfit-list');
if (list && _hwfitCache && Array.isArray(_hwfitCache.models)) {
_hwfitRenderList(list, _applyEngineFilter(_hwfitCache.models));
} else {
_hwfitFetch();
}
});
// Rescan — force a fresh hardware probe (bypasses the per-host cache).
const rescan = document.getElementById('hwfit-rescan');
if (rescan && !rescan.dataset.bound) {

View File

@@ -417,11 +417,40 @@ export function _buildServeCmd(f, modelName, backend) {
// renders modern GGUF chat templates that the Python bindings' Jinja2
// rejects (do_tojson ensure_ascii). Fall back to llama_cpp.server.
// Don't suppress stderr — surface real errors (missing file, lib, OOM).
const _lcpServer = `${lcPrefix}${py} -m llama_cpp.server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} --n_gpu_layers ${f.ngl || '99'} --n_ctx ${f.ctx || '8192'}`;
// Optional perf/fit flags from a hardware profile (see services/hwfit/
// profiles.py). n_cpu_moe offloads MoE expert layers to CPU when the model
// is bigger than VRAM; flash-attn + a quantized KV cache cut KV memory and
// speed things up. Only emitted when set, so manual/older flows are unchanged.
const _ncm = (f.n_cpu_moe ?? '').toString().trim();
const _kv = (f.cache_type ?? '').toString().trim();
let _lcExtra = '';
let _lcpExtra = '';
if (_ncm !== '' && Number(_ncm) > 0) {
_lcExtra += ` --n-cpu-moe ${_ncm}`;
_lcpExtra += ` --n_cpu_moe ${_ncm}`; // llama-cpp-python uses underscores
}
if (f.flash_attn) {
_lcExtra += ' --flash-attn on';
_lcpExtra += ' --flash_attn true';
}
if (_kv) {
_lcExtra += ` --cache-type-k ${_kv} --cache-type-v ${_kv}`;
// llama-cpp-python exposes these as type_k/type_v; pass through best-effort.
_lcpExtra += ` --type_k ${_kv} --type_v ${_kv}`;
}
// Vision: serve the multimodal projector so the model can read images. The
// mmproj path is resolved at runtime (find mmproj-*.gguf next to the model);
// only emitted when the Vision toggle is on AND a projector was found.
if (f.vision && f._mmproj_path) {
_lcExtra += ` --mmproj "${f._mmproj_path}" --image-max-tokens 1024`;
// llama-cpp-python takes the projector via --clip_model_path.
_lcpExtra += ` --clip_model_path "${f._mmproj_path}"`;
}
const _lcpServer = `${lcPrefix}${py} -m llama_cpp.server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} --n_gpu_layers ${f.ngl || '99'} --n_ctx ${f.ctx || '8192'}${_lcpExtra}`;
if (_isWindows()) {
cmd += _lcpServer;
} else {
cmd += `${lcPrefix}llama-server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} -ngl ${f.ngl || '99'} -c ${f.ctx || '8192'}`;
cmd += `${lcPrefix}llama-server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} -ngl ${f.ngl || '99'} -c ${f.ctx || '8192'}${_lcExtra}`;
cmd += ` || ${_lcpServer}`;
}
} else if (backend === 'ollama') {
@@ -1460,6 +1489,16 @@ function _renderRecipes() {
html += '<option value="Q3_K_M">Q3</option><option value="Q2_K">Q2</option>';
html += '<option value="AWQ-4bit">AWQ</option><option value="FP8">FP8</option>';
html += '<option value="">Native</option></select>';
// Engine filter: show only models whose serve engine matches. "llama.cpp"
// (GGUF) runs everywhere incl. consumer AMD/Apple; vLLM/SGLang are CUDA /
// datacenter-ROCm. Filtering is client-side via _detectBackend() in the
// hwfit renderer, so it composes with the quant/type/search filters.
html += '<select class="cookbook-field-input hwfit-engine" id="hwfit-engine" style="height:28px;" title="Filter by serving engine">';
html += '<option value="">Engine</option>';
html += '<option value="llamacpp">llama.cpp</option>';
html += '<option value="vllm">vLLM</option>';
html += '<option value="sglang">SGLang</option>';
html += '</select>';
html += '</div>';
html += '<div class="hwfit-toolbar" style="margin-top:7px;">';
html += '<select class="cookbook-field-input hwfit-server-select" id="hwfit-server-select" style="height:28px;min-width:88px;position:relative;top:0px;">';
@@ -1469,6 +1508,8 @@ function _renderRecipes() {
// Scan/refresh button (icon-only) where the quant dropdown used to sit.
html += '<button type="button" class="hwfit-gpu-btn" id="hwfit-rescan" title="Re-scan hardware" style="flex-shrink:0;position:relative;top:-3px;left:-1px;">↻ RESCAN</button>';
html += '<button type="button" class="hwfit-gpu-btn hwfit-hw-manual-btn" id="hwfit-hw-manual-btn" title="Set hardware manually" style="flex-shrink:0;position:relative;top:-3px;left:-1px;">EDIT</button>';
// Sort state — the clickable column headers read/write this (pewds' original
// sort paradigm). Newest is reachable by clicking the Model column header.
html += '<select class="cookbook-field-input hwfit-sort" id="hwfit-sort" style="display:none">';
html += '<option value="fit">Fit</option><option value="score">Score</option><option value="vram">VRAM</option>';
html += '<option value="speed">Speed</option><option value="params">Params</option>';

View File

@@ -542,6 +542,27 @@ function _rerenderCachedModels() {
panelHtml += `<label class="hwfit-sf-cb"><input type="checkbox" class="hwfit-sf" data-field="prefix_cache"${sv('prefix_cache',false)?' checked':''} /> Prefix Caching${_h('Cache shared prompt prefixes across requests')}</label>`;
panelHtml += `<label class="hwfit-sf-cb hwfit-backend-vllm"><input type="checkbox" class="hwfit-sf" data-field="auto_tool"${sv('auto_tool',false)?' checked':''} /> Auto Tool Choice${_h('Enable function/tool calling for agent mode')}</label>`;
panelHtml += `</div>`;
// Row 2c: llama.cpp fit/perf flags (set by Auto profiles, editable by hand)
const _kvOpts = ['', 'q4_0', 'q8_0', 'f16'].map(k => `<option value="${k}"${sv('cache_type','')===k?' selected':''}>${k||'default'}</option>`).join('');
panelHtml += `<div class="hwfit-serve-row hwfit-backend-llamacpp">`;
panelHtml += `<label>${_l('CPU MoE','n-cpu-moe: number of MoE expert layers to run on CPU when the model is bigger than VRAM. 0 = all on GPU. Set automatically by the Auto profiles below.')}<input type="text" class="hwfit-sf" data-field="n_cpu_moe" value="${esc(sv('n_cpu_moe',''))}" placeholder="0" style="width:54px;" /></label>`;
panelHtml += `<label>${_l('KV Cache','cache-type-k/v: quantize the KV cache. q4_0 = smallest (more context), q8_0 = sharp long-context, f16 = full. Blank = llama.cpp default.')}<select class="hwfit-sf" data-field="cache_type">${_kvOpts}</select></label>`;
panelHtml += `<label class="hwfit-sf-cb" style="align-self:end;"><input type="checkbox" class="hwfit-sf" data-field="flash_attn"${sv('flash_attn',false)?' checked':''} /> Flash Attn${_h('--flash-attn on: faster attention + needed for quantized KV cache.')}</label>`;
panelHtml += `<label class="hwfit-sf-cb" style="align-self:end;"><input type="checkbox" class="hwfit-sf" data-field="vision"${sv('vision',false)?' checked':''} /> Vision${_h('Serve with the vision encoder so the model can read images. Auto-finds an mmproj-*.gguf next to the model (download one into the model folder). Adds ~1 GB VRAM + a small per-image cost.')}</label>`;
panelHtml += `</div>`;
// Row 2d: Auto profiles — computed from detected hardware (see profiles.py).
// Buttons are injected after the panel mounts (needs an async fetch).
panelHtml += `<div class="hwfit-serve-row hwfit-backend-llamacpp hwfit-serve-profiles" style="align-items:center;gap:8px;">`;
panelHtml += `<span style="opacity:0.7;font-size:11px;">Auto profiles:</span>`;
panelHtml += `<span class="hwfit-profile-btns" style="display:flex;gap:6px;flex-wrap:wrap;"><span style="opacity:0.5;font-size:11px;">computing…</span></span>`;
panelHtml += `</div>`;
// Live VRAM / RAM-spillover monitor for the serve target's GPU. Polls
// /api/cookbook/gpus while the panel is open so you can SEE whether the
// config fits VRAM (fast) or spills to system RAM (slow). Populated after mount.
panelHtml += `<div class="hwfit-serve-row hwfit-backend-llamacpp hwfit-vram-monitor" style="align-items:center;gap:8px;font-size:11px;">`;
panelHtml += `<span style="opacity:0.7;">GPU memory:</span>`;
panelHtml += `<span class="hwfit-vram-readout" style="opacity:0.5;">checking…</span>`;
panelHtml += `</div>`;
// Row 3a: Checkboxes (llama.cpp-only)
panelHtml += `<div class="hwfit-serve-checks hwfit-backend-llamacpp">`;
panelHtml += `<label class="hwfit-sf-cb"><input type="checkbox" class="hwfit-sf" data-field="unified_mem"${sv('unified_mem',false)?' checked':''} /> Unified Memory${_h('For AMD APUs / Strix Halo: exports GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 so llama.cpp can address the full BIOS VRAM carveout instead of the default ~28 GB cap. No-op on discrete GPUs.')}</label>`;
@@ -641,6 +662,11 @@ function _rerenderCachedModels() {
: m.is_local_dir && m.path
? `$({ find ${_ldir} -name '*-00001-of-*.gguf' 2>/dev/null | sort; find ${_ldir} -name '*.gguf' 2>/dev/null | sort; } | head -1)`
: `$({ find ${dir} -name '*-00001-of-*.gguf' 2>/dev/null | sort; find ${dir} -name '*.gguf' 2>/dev/null | sort; } | head -1)`;
// Vision: auto-find the mmproj (CLIP/projector) file in the same dir.
// Resolved at runtime so the toggle just works if an mmproj-*.gguf is
// present (downloaded alongside the model). Empty if none → cmd omits it.
const _vsearchdir = (m.is_local_dir && m.path) ? _ldir : dir;
f._mmproj_path = `$(find ${_vsearchdir} -iname 'mmproj*.gguf' 2>/dev/null | sort | head -1)`;
}
if (f.reasoning_parser) {
const _rpEl2 = panel.querySelector('[data-field="reasoning_parser"]');
@@ -655,6 +681,151 @@ function _rerenderCachedModels() {
}
updateCmd();
// Context clamp. Two ceilings:
// - ABSOLUTE_CTX_MAX: a hard sanity cap (no LLM trains past ~1M tokens),
// so an obvious typo like 16000000 can never reach llama.cpp even when
// we don't know the model's real limit (not in catalog / profiles
// fetch failed). This is what stops the radv ErrorDeviceLost crash.
// - panel._modelCtxMax: the model's actual trained limit (set by the
// profiles fetch below) — a tighter, model-specific cap when known.
const ABSOLUTE_CTX_MAX = 1048576; // 1M tokens — above any real n_ctx_train
const _ctxEl0 = panel.querySelector('[data-field="ctx"]');
function _clampCtx(announce) {
if (!_ctxEl0) return;
const cap = panel._modelCtxMax > 0 ? panel._modelCtxMax : ABSOLUTE_CTX_MAX;
const v = parseInt(_ctxEl0.value, 10);
if (Number.isFinite(v) && v > cap) {
_ctxEl0.value = String(cap);
_ctxEl0.title = `Capped to ${panel._modelCtxMax > 0 ? "this model's trained limit" : "the maximum sane context"} (${cap}).`;
if (announce) uiModule.showToast(`Context capped to ${cap}`);
updateCmd();
}
}
if (_ctxEl0) {
_ctxEl0.addEventListener('change', () => _clampCtx(false));
_ctxEl0.addEventListener('blur', () => _clampCtx(false));
_clampCtx(false); // fix any stale/preset value already present
}
// Auto profiles — fetch hardware-computed llama.cpp profiles and render
// them as clickable chips. Clicking one fills the ctx/CPU-MoE/KV/flash
// fields and rebuilds the command. Computed from detected VRAM (see
// services/hwfit/profiles.py); rough on t/s, accurate on fit.
async function _loadServeProfiles() {
const wrap = panel.querySelector('.hwfit-profile-btns');
if (!wrap) return;
try {
const host = (_es.remoteHost || '').trim();
const params = new URLSearchParams({ model: repo });
if (host) {
params.set('host', host);
const _sp = (_es.servers || []).find(s => s.host === host)?.port;
if (_sp) params.set('ssh_port', _sp);
}
// SERVE mode: this is a specific GGUF file already on disk, so its quant
// is fixed — tell the profiler the file's real size + quant so it varies
// only the serving knobs (KV/ctx/offload), not the quant. Parse the size
// from m.size (e.g. "20.6 GB") and the quant from the file/repo name.
const _sizeMatch = String(m.size || '').match(/([\d.]+)\s*GB/i);
if (_sizeMatch) params.set('serve_weights_gb', _sizeMatch[1]);
const _qMatch = String(repo).match(/(Q\d[\w]*|IQ\d[\w]*|F16|BF16|FP8)/i);
if (_qMatch) params.set('serve_quant', _qMatch[1]);
const res = await fetch(`/api/hwfit/profiles?${params}`);
const data = await res.json();
// Remember the model's trained context limit and clamp the ctx field
// to it — asking llama.cpp for ctx > n_ctx_train overflows and, with a
// quantized KV cache, can crash the GPU (radv ErrorDeviceLost).
const ctxMax = Number(data && data.model_ctx_max) || 0;
if (ctxMax > 0) {
panel._modelCtxMax = ctxMax; // tighten the clamp to the real limit
_clampCtx(false); // re-apply now that we know the model's max
}
const profs = (data && Array.isArray(data.profiles)) ? data.profiles : [];
if (!profs.length) { wrap.innerHTML = `<span style="opacity:0.5;font-size:11px;">no auto profile for this model</span>`; return; }
wrap.innerHTML = '';
for (const p of profs) {
const b = document.createElement('button');
b.type = 'button';
b.className = 'cookbook-btn hwfit-profile-chip';
b.style.cssText = 'height:24px;padding:0 9px;font-size:11px;';
const off = p.offloads ? `, ncm${p.n_cpu_moe}` : ', all-GPU';
b.textContent = `${p.label} · ${p.quant} · ${Math.round(p.ctx/1024)}k${off}`;
b.title = `${p.note}\nKV ${p.cache_type}, ~${p.est_vram_gb} GB VRAM`;
b.addEventListener('click', () => {
const set = (field, val) => {
const el = panel.querySelector(`[data-field="${field}"]`);
if (!el) return;
if (el.type === 'checkbox') el.checked = !!val; else el.value = val;
};
set('ctx', p.ctx);
set('n_cpu_moe', p.n_cpu_moe || '');
set('cache_type', p.cache_type || '');
set('flash_attn', true); // required for a quantized KV cache
wrap.querySelectorAll('.hwfit-profile-chip').forEach(x => x.classList.remove('cookbook-btn-active'));
b.classList.add('cookbook-btn-active');
updateCmd();
});
wrap.appendChild(b);
}
} catch {
wrap.innerHTML = `<span style="opacity:0.5;font-size:11px;">profile compute failed</span>`;
}
}
_loadServeProfiles();
// Live GPU-memory monitor: poll /api/cookbook/gpus and show VRAM usage +
// RAM-spillover, with a plain-language health/speed hint. Lets you tell at
// a glance whether the chosen config fits VRAM (fast) or is paging into
// system RAM over PCIe (slow). AMD sysfs reports gtt_used_mb for spillover.
async function _refreshVramMonitor() {
const el = panel.querySelector('.hwfit-vram-readout');
if (!el || !document.body.contains(el)) return false; // panel closed → stop
try {
const host = (_es.remoteHost || '').trim();
const params = new URLSearchParams();
if (host) {
params.set('host', host);
const _sp = (_es.servers || []).find(s => s.host === host)?.port;
if (_sp) params.set('ssh_port', _sp);
}
const res = await fetch('/api/cookbook/gpus' + (params.toString() ? '?' + params : ''));
const data = await res.json();
const gpus = Array.isArray(data) ? data : (data.gpus || []);
if (!gpus.length) { el.textContent = 'no GPU detected'; el.style.color = ''; return true; }
const g = gpus[0];
const usedG = (g.used_mb / 1024), totG = (g.total_mb / 1024);
const pct = totG ? Math.round((usedG / totG) * 100) : 0;
const freeG = Math.max(0, totG - usedG);
const spillG = (g.gtt_used_mb || 0) / 1024;
// Color: green < 85%, amber 85-97%, red > 97% or spilling.
const spilling = spillG > 0.5 && !g.unified_memory; // unified APUs always use GTT; not a spill
let color = 'var(--green, #50fa7b)';
if (pct >= 97 || spilling) color = 'var(--red, #ff5555)';
else if (pct >= 85) color = 'var(--orange, #ffb86c)';
let txt = `${usedG.toFixed(1)} / ${totG.toFixed(1)} GB (${pct}%) · ${freeG.toFixed(1)} GB free`;
if (spilling) {
txt += ` · ⚠ ${spillG.toFixed(1)} GB spilled to RAM — slow (raise CPU MoE or lower context)`;
} else if (pct >= 90) {
txt += ` · tight — risk of OOM/spill on long context or images`;
} else {
txt += ` · healthy`;
}
el.textContent = txt;
el.style.color = color;
return true;
} catch {
el.textContent = 'unavailable';
el.style.color = '';
return true;
}
}
_refreshVramMonitor();
// Poll every 4s while the panel is open; stop when it's removed from the DOM.
const _vramTimer = setInterval(async () => {
const ok = await _refreshVramMonitor();
if (ok === false) clearInterval(_vramTimer);
}, 4000);
// Show/hide backend-specific sections
function updateBackendVisibility() {
const b = panel.querySelector('[data-field="backend"]')?.value || 'vllm';
@@ -1313,6 +1484,12 @@ function _rerenderCachedModels() {
// Launch button
panel.querySelector('.hwfit-serve-launch').addEventListener('click', async (ev) => {
const _launchBtn = ev.currentTarget;
// Final safety net: never launch with ctx beyond the model's trained
// limit (or the absolute sanity ceiling when the limit is unknown). A
// stale preset or typo (e.g. 16000000) overflows and, with a quantized
// KV cache, can crash the GPU. Skip only if the user hand-edited the raw
// command (then we respect their literal text).
if (!_cmdManuallyEdited) _clampCtx(true);
if (!_cmdManuallyEdited) updateCmd();
const launchCmd = _cmdTextarea ? _cmdTextarea.value.trim() : panel._cmd;
const serveState = {};

View File

@@ -1744,6 +1744,12 @@ body.bg-pattern-sparkles {
padding-left: max(0px, calc((100% - var(--chat-max)) / 2));
padding-right: max(12px, calc((100% - var(--chat-max)) / 2 + 12px));
}
/* Sortable Cookbook column headers had no visual cue, so users couldn't tell
a header was clickable (the Newest sort on the Model column was invisible).
Show a pointer + hover highlight, and underline the active sort column. */
.hwfit-header .hwfit-sortable { cursor: pointer; transition: color .12s; }
.hwfit-header .hwfit-sortable:hover { color: var(--fg); text-decoration: underline dotted; }
.hwfit-header .hwfit-sort-active { color: var(--fg); font-weight: 600; }
/* Welcome screen — centered in available space above input bar */
#welcome-screen {
position:absolute;

View File

@@ -0,0 +1,110 @@
"""Intelligent llama.cpp serve profiles computed from hardware.
Locks in that compute_serve_profiles() turns detected VRAM + model size into
sane Quality/Balanced/Speed flag sets: a too-big MoE offloads experts to CPU
(n_cpu_moe > 0) instead of failing, a model that fits stays fully on GPU
(n_cpu_moe == 0), context shrinks before giving up, and quant choice tracks the
profile intent.
"""
from services.hwfit.profiles import compute_serve_profiles
_QWEN_35B_MOE = {
"name": "Qwen3.6-35B-A3B",
"parameter_count": "35B",
"is_moe": True,
"active_parameters": 3_000_000_000,
"num_hidden_layers": 48,
}
_DENSE_8B = {
"name": "Qwen3-8B",
"parameter_count": "8B",
"is_moe": False,
"num_hidden_layers": 36,
}
def _sys(vram, family="rdna"):
return {"backend": "rocm", "gpu_vram_gb": vram, "gpu_family": family}
def test_big_moe_on_small_card_offloads_not_fails():
"""A 35B MoE can't hold its weights on 16 GB, so the Quality profile must
offload experts to CPU (n_cpu_moe > 0) rather than be dropped."""
profs = compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE)
assert profs, "expected at least one profile"
q = next(p for p in profs if p["key"] == "quality")
assert q["n_cpu_moe"] > 0
assert q["offloads"] is True
assert q["cache_type"] == "q8_0" # quality uses the sharp KV cache
assert q["est_vram_gb"] <= 16.0 # never exceeds the card
def test_profiles_never_exceed_vram():
"""Every profile's VRAM estimate must fit the detected card."""
for vram in (8.0, 12.0, 16.0, 24.0):
for p in compute_serve_profiles(_sys(vram), _QWEN_35B_MOE):
assert p["est_vram_gb"] <= vram + 0.05, (vram, p)
def test_small_model_stays_fully_on_gpu():
"""A model whose weights fit must NOT offload — n_cpu_moe == 0 everywhere."""
for p in compute_serve_profiles(_sys(15.9), _DENSE_8B):
assert p["n_cpu_moe"] == 0
assert p["offloads"] is False
def test_speed_profile_is_lighter_than_quality():
"""Speed trades quant/context for less offload than Quality."""
profs = {p["key"]: p for p in compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE)}
if "speed" in profs and "quality" in profs:
assert profs["speed"]["n_cpu_moe"] <= profs["quality"]["n_cpu_moe"]
assert profs["speed"]["ctx"] <= profs["quality"]["ctx"]
def test_flags_are_launchable():
"""Each profile must carry the concrete llama.cpp flags the cmd builder needs."""
for p in compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE):
assert p["n_gpu_layers"] == 999
assert isinstance(p["n_cpu_moe"], int) and p["n_cpu_moe"] >= 0
assert p["cache_type"] in ("q4_0", "q8_0", "f16")
assert p["ctx"] >= 8192
assert p["quant"]
def test_context_capped_at_model_limit():
"""Profiles must never propose more context than the model was trained for
— over-asking triggers a training-context overflow and, with a quantized KV
cache, a GPU OOM/device-lost crash."""
small_ctx_model = dict(_QWEN_35B_MOE, name="X", context_length=32768)
for p in compute_serve_profiles(_sys(15.9), small_ctx_model):
assert p["ctx"] <= 32768, p
def test_no_gpu_returns_empty():
"""No VRAM detected → no GPU profiles (caller falls back to manual flags)."""
assert compute_serve_profiles({"backend": "cpu_x86", "gpu_vram_gb": 0}, _QWEN_35B_MOE) == []
def test_vision_model_leaves_encoder_headroom():
"""A vision model must budget extra VRAM for the image encoder, so its
estimate leaves more slack below the card than a text model would."""
vis = dict(_QWEN_35B_MOE, name="Qwen3-VL-35B", is_multimodal=True)
for p in compute_serve_profiles(_sys(15.9), vis):
assert p["est_vram_gb"] <= 15.9 - 1.0 + 0.05 # ~1.1 GB encoder headroom
def test_serve_mode_keeps_fixed_quant():
"""Serving a specific GGUF file: the quant is fixed (the file's), so every
profile must keep it and vary only the serving knobs (KV/ctx/offload) — not
propose a different quant (which makes no sense for an on-disk file)."""
profs = compute_serve_profiles(_sys(15.9), _QWEN_35B_MOE,
serve_weights_gb=20.6, serve_quant="Q4_K_M")
assert profs
assert all(p["quant"] == "Q4_K_M" for p in profs), [p["quant"] for p in profs]
# The knobs should still differ across profiles (KV type and/or context).
kvs = {p["cache_type"] for p in profs}
ctxs = {p["ctx"] for p in profs}
assert len(kvs) > 1 or len(ctxs) > 1, "serve profiles are identical"
# All must fit the card.
assert all(p["est_vram_gb"] <= 16.0 for p in profs)