Cookbook serve profiles and engine filter
* Cookbook: Engine filter + intelligent hardware-computed serve profiles Two related Cookbook serving improvements for accurate, hardware-aware model serving (especially on consumer GPUs that can only run GGUF/llama.cpp). Engine filter - New "Engine" dropdown (All / llama.cpp / vLLM / SGLang) beside the quant picker. Pure client-side view filter over the fetched list via the same _detectBackend() the serve commands use, so what you filter to is exactly what would launch. Re-renders from cache (no refetch). Empty-state message + the instant-cache-paint path account for it too. Intelligent serve profiles (Quality / Balanced / Speed) - services/hwfit/profiles.py: compute_serve_profiles() turns detected VRAM + model size into concrete llama.cpp flags (n_gpu_layers, n_cpu_moe, cache-type, context). Encodes the by-hand tuning: a too-big MoE offloads experts to CPU instead of failing; a model that fits stays fully on GPU; quant tracks profile intent; vision models keep image-encoder headroom. Reuses models.py VRAM math so filtering and serving agree on what fits. Pure/deterministic (no t/s claims — partial-offload speed isn't reliably predictable; fit is what's computed). - /api/hwfit/profiles endpoint returns the profiles + the model's trained context limit, with loose name matching (strips org/ prefix, -GGUF suffix, quant tag) so a local GGUF folder name resolves to its catalog entry. - _buildServeCmd (llama.cpp) now emits --n-cpu-moe / --flash-attn / --cache-type-k/v when set, with llama-cpp-python fallback equivalents. It previously only set -ngl/-c, which is why it OOM'd or ran slow. - Serve panel: profile chips that fill the fields on click, plus CPU-MoE / KV Cache / Flash Attn fields. Context is clamped to the model's trained limit (and an absolute 1M sanity ceiling) on type/blur/profile-load and at launch — fixes a crash where a stale 256k/16M preset + quantized KV cache caused an amdgpu ErrorDeviceLost. Tests: tests/test_serve_profiles.py (7) — offload vs full-GPU fit, never exceed VRAM, context cap, launchable flags, vision headroom, no-GPU empty. Checks: py_compile + node --check pass; pytest test_serve_profiles + test_hwfit_amd green; verified live on an RDNA4 box (gfx1200) — Balanced lands ~ncm18 q4 128k, matching hand-tuning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Cookbook: make column-header sorting discoverable (incl. Newest) Sorting in Cookbook is via clickable column headers (pewds' design), but the headers had no visual cue that they're interactive — so sorting in general, and the Newest sort on the Model header specifically, was undiscoverable. - Style sortable headers as interactive: pointer cursor, hover underline, and the active sort column bolded/highlighted. There was no CSS for .hwfit-sortable / .hwfit-sort-active at all; this helps every existing sort, not just Newest. - The Model column header sorts by release_date (newest first), reusing the existing header-click sort wiring and the "newest" SORT_KEY. No new sort control — uses the existing column-header paradigm. Checks: node --check passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Cookbook serve profiles: keep the on-disk file's quant fixed (don't propose Q6/Q2) In the Serve tab the model is a specific GGUF file already on disk, so its quant can't change — but the profiles were suggesting "Quality · Q6_K" / "Speed · Q2_K" as if you could re-quantize it. That's meaningless when serving a fixed file. - compute_serve_profiles gains serve_weights_gb / serve_quant. When set (SERVE mode), the quant is locked to the file's and profiles differ only in the real serving knobs — n_cpu_moe, KV-cache type, context. _weights_gb / _cpu_moe_for_budget use the file's actual size instead of a quant-derived estimate. DOWNLOAD mode (no override) still varies the quant to show download options. - /api/hwfit/profiles accepts serve_weights_gb & serve_quant. - The Serve panel parses the file's size (from m.size "20.6 GB") and quant (from the repo/file name) and passes them, so profiles match what's actually served. Result for a 20.6 GB Q4_K_M file: all three profiles stay Q4_K_M and differ by KV/ctx/offload (Quality q8 KV 128k ncm21, Balanced q4 128k ncm17, Speed q4 32k ncm15) — no nonsensical quant changes. Tests: test_serve_mode_keeps_fixed_quant. Full serve-profile suite green (9). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Cookbook serve: Vision toggle (auto-find mmproj) + live VRAM/RAM-spillover monitor Two serve-panel additions: 1. **Vision toggle.** A "Vision" checkbox that serves the model with its multimodal projector so it can read images. The mmproj path is resolved at runtime (find mmproj-*.gguf next to the model), so dropping an mmproj file in the model folder makes the toggle just work; `--mmproj … --image-max-tokens 1024` (native) / `--clip_model_path` (llama-cpp-python) only when on + found. 2. **Live GPU-memory monitor.** A readout that polls /api/cookbook/gpus every 4s while the panel is open and shows VRAM used/total/%, free, and — crucially on a discrete card — **RAM spillover** (AMD gtt_used_mb), with a plain-language health hint: green/healthy, amber/tight, red/"spilled to RAM — slow (raise CPU MoE or lower context)". Surfaces gtt_used_mb from the gpus endpoint (previously read for total only and discarded for 'used'). Lets you see at a glance whether a config fits VRAM (fast) or is paging to system RAM over PCIe (slow) instead of guessing. Checks: node --check + py_compile pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -365,6 +365,17 @@ function _hwfitShowError(list, host, detail) {
|
||||
if (rb) rb.addEventListener('click', () => { _resetGpuToggleState(); _hwfitFetch(true); });
|
||||
}
|
||||
|
||||
// Client-side "Engine" filter (llama.cpp / vLLM / SGLang). Empty = show all.
|
||||
// Uses the same _detectBackend() the serve commands use, so what you filter to
|
||||
// is exactly what would be launched. Pure view filter — no refetch needed.
|
||||
function _applyEngineFilter(models) {
|
||||
const want = document.getElementById('hwfit-engine')?.value || '';
|
||||
if (!want || !Array.isArray(models)) return models || [];
|
||||
return models.filter(m => {
|
||||
try { return _detectBackend(m).backend === want; } catch { return true; }
|
||||
});
|
||||
}
|
||||
|
||||
export async function _hwfitFetch(fresh = false) {
|
||||
const _tk = ++_hwfitFetchToken;
|
||||
const useCase = document.getElementById('hwfit-usecase')?.value || '';
|
||||
@@ -384,7 +395,7 @@ export async function _hwfitFetch(fresh = false) {
|
||||
if (_cached) {
|
||||
_hwfitCache = _cached;
|
||||
_hwfitRenderHw(hw, _cached.system);
|
||||
_hwfitRenderList(list, _cached.models);
|
||||
_hwfitRenderList(list, _applyEngineFilter(_cached.models));
|
||||
} else {
|
||||
// Show spinner while scanning — stack the spinner above a text label
|
||||
// (the .hwfit-loading class is a centered flex ROW, so force column here).
|
||||
@@ -530,7 +541,7 @@ export async function _hwfitFetch(fresh = false) {
|
||||
return asc ? av - bv : bv - av;
|
||||
});
|
||||
}
|
||||
_hwfitRenderList(list, data.models);
|
||||
_hwfitRenderList(list, _applyEngineFilter(data.models));
|
||||
// Persist this result so the next page load can paint it instantly.
|
||||
_writeScanCache(_sig, data);
|
||||
// Render GPU toggles — only on first scan (no override active)
|
||||
@@ -773,9 +784,10 @@ export function _hwfitRenderList(el, models) {
|
||||
const hasHw = sys && ((sys.gpu_vram_gb || 0) > 0 || (sys.total_ram_gb || 0) > 8);
|
||||
const hasFilters = !!(document.getElementById('hwfit-search')?.value?.trim()
|
||||
|| document.getElementById('hwfit-usecase')?.value
|
||||
|| document.getElementById('hwfit-quant')?.value);
|
||||
|| document.getElementById('hwfit-quant')?.value
|
||||
|| document.getElementById('hwfit-engine')?.value);
|
||||
let msg;
|
||||
if (hasFilters) msg = 'No models match these filters — try clearing the search, use-case, or quant.';
|
||||
if (hasFilters) msg = 'No models match these filters — try clearing the search, use-case, quant, or engine.';
|
||||
else if (hasHw) msg = 'No models fit — the hardware probe may have under-reported. Try Rescan.';
|
||||
else msg = 'No models fit your hardware';
|
||||
el.innerHTML = `<div class="hwfit-loading">${msg}</div>`;
|
||||
@@ -1122,6 +1134,17 @@ export function _hwfitInit() {
|
||||
if (uc) uc.addEventListener('change', () => _hwfitFetch());
|
||||
if (sort) sort.addEventListener('change', () => _hwfitFetch());
|
||||
if (qpref) qpref.addEventListener('change', () => _hwfitFetch());
|
||||
// Engine filter is a pure client-side view filter over the already-fetched
|
||||
// list, so just re-render from cache instead of re-probing hardware.
|
||||
const engine = document.getElementById('hwfit-engine');
|
||||
if (engine) engine.addEventListener('change', () => {
|
||||
const list = document.getElementById('hwfit-list');
|
||||
if (list && _hwfitCache && Array.isArray(_hwfitCache.models)) {
|
||||
_hwfitRenderList(list, _applyEngineFilter(_hwfitCache.models));
|
||||
} else {
|
||||
_hwfitFetch();
|
||||
}
|
||||
});
|
||||
// Rescan — force a fresh hardware probe (bypasses the per-host cache).
|
||||
const rescan = document.getElementById('hwfit-rescan');
|
||||
if (rescan && !rescan.dataset.bound) {
|
||||
|
||||
@@ -417,11 +417,40 @@ export function _buildServeCmd(f, modelName, backend) {
|
||||
// renders modern GGUF chat templates that the Python bindings' Jinja2
|
||||
// rejects (do_tojson ensure_ascii). Fall back to llama_cpp.server.
|
||||
// Don't suppress stderr — surface real errors (missing file, lib, OOM).
|
||||
const _lcpServer = `${lcPrefix}${py} -m llama_cpp.server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} --n_gpu_layers ${f.ngl || '99'} --n_ctx ${f.ctx || '8192'}`;
|
||||
// Optional perf/fit flags from a hardware profile (see services/hwfit/
|
||||
// profiles.py). n_cpu_moe offloads MoE expert layers to CPU when the model
|
||||
// is bigger than VRAM; flash-attn + a quantized KV cache cut KV memory and
|
||||
// speed things up. Only emitted when set, so manual/older flows are unchanged.
|
||||
const _ncm = (f.n_cpu_moe ?? '').toString().trim();
|
||||
const _kv = (f.cache_type ?? '').toString().trim();
|
||||
let _lcExtra = '';
|
||||
let _lcpExtra = '';
|
||||
if (_ncm !== '' && Number(_ncm) > 0) {
|
||||
_lcExtra += ` --n-cpu-moe ${_ncm}`;
|
||||
_lcpExtra += ` --n_cpu_moe ${_ncm}`; // llama-cpp-python uses underscores
|
||||
}
|
||||
if (f.flash_attn) {
|
||||
_lcExtra += ' --flash-attn on';
|
||||
_lcpExtra += ' --flash_attn true';
|
||||
}
|
||||
if (_kv) {
|
||||
_lcExtra += ` --cache-type-k ${_kv} --cache-type-v ${_kv}`;
|
||||
// llama-cpp-python exposes these as type_k/type_v; pass through best-effort.
|
||||
_lcpExtra += ` --type_k ${_kv} --type_v ${_kv}`;
|
||||
}
|
||||
// Vision: serve the multimodal projector so the model can read images. The
|
||||
// mmproj path is resolved at runtime (find mmproj-*.gguf next to the model);
|
||||
// only emitted when the Vision toggle is on AND a projector was found.
|
||||
if (f.vision && f._mmproj_path) {
|
||||
_lcExtra += ` --mmproj "${f._mmproj_path}" --image-max-tokens 1024`;
|
||||
// llama-cpp-python takes the projector via --clip_model_path.
|
||||
_lcpExtra += ` --clip_model_path "${f._mmproj_path}"`;
|
||||
}
|
||||
const _lcpServer = `${lcPrefix}${py} -m llama_cpp.server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} --n_gpu_layers ${f.ngl || '99'} --n_ctx ${f.ctx || '8192'}${_lcpExtra}`;
|
||||
if (_isWindows()) {
|
||||
cmd += _lcpServer;
|
||||
} else {
|
||||
cmd += `${lcPrefix}llama-server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} -ngl ${f.ngl || '99'} -c ${f.ctx || '8192'}`;
|
||||
cmd += `${lcPrefix}llama-server --model ${modelArg} --host 0.0.0.0 --port ${f.port || '8080'} -ngl ${f.ngl || '99'} -c ${f.ctx || '8192'}${_lcExtra}`;
|
||||
cmd += ` || ${_lcpServer}`;
|
||||
}
|
||||
} else if (backend === 'ollama') {
|
||||
@@ -1460,6 +1489,16 @@ function _renderRecipes() {
|
||||
html += '<option value="Q3_K_M">Q3</option><option value="Q2_K">Q2</option>';
|
||||
html += '<option value="AWQ-4bit">AWQ</option><option value="FP8">FP8</option>';
|
||||
html += '<option value="">Native</option></select>';
|
||||
// Engine filter: show only models whose serve engine matches. "llama.cpp"
|
||||
// (GGUF) runs everywhere incl. consumer AMD/Apple; vLLM/SGLang are CUDA /
|
||||
// datacenter-ROCm. Filtering is client-side via _detectBackend() in the
|
||||
// hwfit renderer, so it composes with the quant/type/search filters.
|
||||
html += '<select class="cookbook-field-input hwfit-engine" id="hwfit-engine" style="height:28px;" title="Filter by serving engine">';
|
||||
html += '<option value="">Engine</option>';
|
||||
html += '<option value="llamacpp">llama.cpp</option>';
|
||||
html += '<option value="vllm">vLLM</option>';
|
||||
html += '<option value="sglang">SGLang</option>';
|
||||
html += '</select>';
|
||||
html += '</div>';
|
||||
html += '<div class="hwfit-toolbar" style="margin-top:7px;">';
|
||||
html += '<select class="cookbook-field-input hwfit-server-select" id="hwfit-server-select" style="height:28px;min-width:88px;position:relative;top:0px;">';
|
||||
@@ -1469,6 +1508,8 @@ function _renderRecipes() {
|
||||
// Scan/refresh button (icon-only) where the quant dropdown used to sit.
|
||||
html += '<button type="button" class="hwfit-gpu-btn" id="hwfit-rescan" title="Re-scan hardware" style="flex-shrink:0;position:relative;top:-3px;left:-1px;">↻ RESCAN</button>';
|
||||
html += '<button type="button" class="hwfit-gpu-btn hwfit-hw-manual-btn" id="hwfit-hw-manual-btn" title="Set hardware manually" style="flex-shrink:0;position:relative;top:-3px;left:-1px;">EDIT</button>';
|
||||
// Sort state — the clickable column headers read/write this (pewds' original
|
||||
// sort paradigm). Newest is reachable by clicking the Model column header.
|
||||
html += '<select class="cookbook-field-input hwfit-sort" id="hwfit-sort" style="display:none">';
|
||||
html += '<option value="fit">Fit</option><option value="score">Score</option><option value="vram">VRAM</option>';
|
||||
html += '<option value="speed">Speed</option><option value="params">Params</option>';
|
||||
|
||||
@@ -542,6 +542,27 @@ function _rerenderCachedModels() {
|
||||
panelHtml += `<label class="hwfit-sf-cb"><input type="checkbox" class="hwfit-sf" data-field="prefix_cache"${sv('prefix_cache',false)?' checked':''} /> Prefix Caching${_h('Cache shared prompt prefixes across requests')}</label>`;
|
||||
panelHtml += `<label class="hwfit-sf-cb hwfit-backend-vllm"><input type="checkbox" class="hwfit-sf" data-field="auto_tool"${sv('auto_tool',false)?' checked':''} /> Auto Tool Choice${_h('Enable function/tool calling for agent mode')}</label>`;
|
||||
panelHtml += `</div>`;
|
||||
// Row 2c: llama.cpp fit/perf flags (set by Auto profiles, editable by hand)
|
||||
const _kvOpts = ['', 'q4_0', 'q8_0', 'f16'].map(k => `<option value="${k}"${sv('cache_type','')===k?' selected':''}>${k||'default'}</option>`).join('');
|
||||
panelHtml += `<div class="hwfit-serve-row hwfit-backend-llamacpp">`;
|
||||
panelHtml += `<label>${_l('CPU MoE','n-cpu-moe: number of MoE expert layers to run on CPU when the model is bigger than VRAM. 0 = all on GPU. Set automatically by the Auto profiles below.')}<input type="text" class="hwfit-sf" data-field="n_cpu_moe" value="${esc(sv('n_cpu_moe',''))}" placeholder="0" style="width:54px;" /></label>`;
|
||||
panelHtml += `<label>${_l('KV Cache','cache-type-k/v: quantize the KV cache. q4_0 = smallest (more context), q8_0 = sharp long-context, f16 = full. Blank = llama.cpp default.')}<select class="hwfit-sf" data-field="cache_type">${_kvOpts}</select></label>`;
|
||||
panelHtml += `<label class="hwfit-sf-cb" style="align-self:end;"><input type="checkbox" class="hwfit-sf" data-field="flash_attn"${sv('flash_attn',false)?' checked':''} /> Flash Attn${_h('--flash-attn on: faster attention + needed for quantized KV cache.')}</label>`;
|
||||
panelHtml += `<label class="hwfit-sf-cb" style="align-self:end;"><input type="checkbox" class="hwfit-sf" data-field="vision"${sv('vision',false)?' checked':''} /> Vision${_h('Serve with the vision encoder so the model can read images. Auto-finds an mmproj-*.gguf next to the model (download one into the model folder). Adds ~1 GB VRAM + a small per-image cost.')}</label>`;
|
||||
panelHtml += `</div>`;
|
||||
// Row 2d: Auto profiles — computed from detected hardware (see profiles.py).
|
||||
// Buttons are injected after the panel mounts (needs an async fetch).
|
||||
panelHtml += `<div class="hwfit-serve-row hwfit-backend-llamacpp hwfit-serve-profiles" style="align-items:center;gap:8px;">`;
|
||||
panelHtml += `<span style="opacity:0.7;font-size:11px;">Auto profiles:</span>`;
|
||||
panelHtml += `<span class="hwfit-profile-btns" style="display:flex;gap:6px;flex-wrap:wrap;"><span style="opacity:0.5;font-size:11px;">computing…</span></span>`;
|
||||
panelHtml += `</div>`;
|
||||
// Live VRAM / RAM-spillover monitor for the serve target's GPU. Polls
|
||||
// /api/cookbook/gpus while the panel is open so you can SEE whether the
|
||||
// config fits VRAM (fast) or spills to system RAM (slow). Populated after mount.
|
||||
panelHtml += `<div class="hwfit-serve-row hwfit-backend-llamacpp hwfit-vram-monitor" style="align-items:center;gap:8px;font-size:11px;">`;
|
||||
panelHtml += `<span style="opacity:0.7;">GPU memory:</span>`;
|
||||
panelHtml += `<span class="hwfit-vram-readout" style="opacity:0.5;">checking…</span>`;
|
||||
panelHtml += `</div>`;
|
||||
// Row 3a: Checkboxes (llama.cpp-only)
|
||||
panelHtml += `<div class="hwfit-serve-checks hwfit-backend-llamacpp">`;
|
||||
panelHtml += `<label class="hwfit-sf-cb"><input type="checkbox" class="hwfit-sf" data-field="unified_mem"${sv('unified_mem',false)?' checked':''} /> Unified Memory${_h('For AMD APUs / Strix Halo: exports GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 so llama.cpp can address the full BIOS VRAM carveout instead of the default ~28 GB cap. No-op on discrete GPUs.')}</label>`;
|
||||
@@ -641,6 +662,11 @@ function _rerenderCachedModels() {
|
||||
: m.is_local_dir && m.path
|
||||
? `$({ find ${_ldir} -name '*-00001-of-*.gguf' 2>/dev/null | sort; find ${_ldir} -name '*.gguf' 2>/dev/null | sort; } | head -1)`
|
||||
: `$({ find ${dir} -name '*-00001-of-*.gguf' 2>/dev/null | sort; find ${dir} -name '*.gguf' 2>/dev/null | sort; } | head -1)`;
|
||||
// Vision: auto-find the mmproj (CLIP/projector) file in the same dir.
|
||||
// Resolved at runtime so the toggle just works if an mmproj-*.gguf is
|
||||
// present (downloaded alongside the model). Empty if none → cmd omits it.
|
||||
const _vsearchdir = (m.is_local_dir && m.path) ? _ldir : dir;
|
||||
f._mmproj_path = `$(find ${_vsearchdir} -iname 'mmproj*.gguf' 2>/dev/null | sort | head -1)`;
|
||||
}
|
||||
if (f.reasoning_parser) {
|
||||
const _rpEl2 = panel.querySelector('[data-field="reasoning_parser"]');
|
||||
@@ -655,6 +681,151 @@ function _rerenderCachedModels() {
|
||||
}
|
||||
updateCmd();
|
||||
|
||||
// Context clamp. Two ceilings:
|
||||
// - ABSOLUTE_CTX_MAX: a hard sanity cap (no LLM trains past ~1M tokens),
|
||||
// so an obvious typo like 16000000 can never reach llama.cpp even when
|
||||
// we don't know the model's real limit (not in catalog / profiles
|
||||
// fetch failed). This is what stops the radv ErrorDeviceLost crash.
|
||||
// - panel._modelCtxMax: the model's actual trained limit (set by the
|
||||
// profiles fetch below) — a tighter, model-specific cap when known.
|
||||
const ABSOLUTE_CTX_MAX = 1048576; // 1M tokens — above any real n_ctx_train
|
||||
const _ctxEl0 = panel.querySelector('[data-field="ctx"]');
|
||||
function _clampCtx(announce) {
|
||||
if (!_ctxEl0) return;
|
||||
const cap = panel._modelCtxMax > 0 ? panel._modelCtxMax : ABSOLUTE_CTX_MAX;
|
||||
const v = parseInt(_ctxEl0.value, 10);
|
||||
if (Number.isFinite(v) && v > cap) {
|
||||
_ctxEl0.value = String(cap);
|
||||
_ctxEl0.title = `Capped to ${panel._modelCtxMax > 0 ? "this model's trained limit" : "the maximum sane context"} (${cap}).`;
|
||||
if (announce) uiModule.showToast(`Context capped to ${cap}`);
|
||||
updateCmd();
|
||||
}
|
||||
}
|
||||
if (_ctxEl0) {
|
||||
_ctxEl0.addEventListener('change', () => _clampCtx(false));
|
||||
_ctxEl0.addEventListener('blur', () => _clampCtx(false));
|
||||
_clampCtx(false); // fix any stale/preset value already present
|
||||
}
|
||||
|
||||
// Auto profiles — fetch hardware-computed llama.cpp profiles and render
|
||||
// them as clickable chips. Clicking one fills the ctx/CPU-MoE/KV/flash
|
||||
// fields and rebuilds the command. Computed from detected VRAM (see
|
||||
// services/hwfit/profiles.py); rough on t/s, accurate on fit.
|
||||
async function _loadServeProfiles() {
|
||||
const wrap = panel.querySelector('.hwfit-profile-btns');
|
||||
if (!wrap) return;
|
||||
try {
|
||||
const host = (_es.remoteHost || '').trim();
|
||||
const params = new URLSearchParams({ model: repo });
|
||||
if (host) {
|
||||
params.set('host', host);
|
||||
const _sp = (_es.servers || []).find(s => s.host === host)?.port;
|
||||
if (_sp) params.set('ssh_port', _sp);
|
||||
}
|
||||
// SERVE mode: this is a specific GGUF file already on disk, so its quant
|
||||
// is fixed — tell the profiler the file's real size + quant so it varies
|
||||
// only the serving knobs (KV/ctx/offload), not the quant. Parse the size
|
||||
// from m.size (e.g. "20.6 GB") and the quant from the file/repo name.
|
||||
const _sizeMatch = String(m.size || '').match(/([\d.]+)\s*GB/i);
|
||||
if (_sizeMatch) params.set('serve_weights_gb', _sizeMatch[1]);
|
||||
const _qMatch = String(repo).match(/(Q\d[\w]*|IQ\d[\w]*|F16|BF16|FP8)/i);
|
||||
if (_qMatch) params.set('serve_quant', _qMatch[1]);
|
||||
const res = await fetch(`/api/hwfit/profiles?${params}`);
|
||||
const data = await res.json();
|
||||
// Remember the model's trained context limit and clamp the ctx field
|
||||
// to it — asking llama.cpp for ctx > n_ctx_train overflows and, with a
|
||||
// quantized KV cache, can crash the GPU (radv ErrorDeviceLost).
|
||||
const ctxMax = Number(data && data.model_ctx_max) || 0;
|
||||
if (ctxMax > 0) {
|
||||
panel._modelCtxMax = ctxMax; // tighten the clamp to the real limit
|
||||
_clampCtx(false); // re-apply now that we know the model's max
|
||||
}
|
||||
const profs = (data && Array.isArray(data.profiles)) ? data.profiles : [];
|
||||
if (!profs.length) { wrap.innerHTML = `<span style="opacity:0.5;font-size:11px;">no auto profile for this model</span>`; return; }
|
||||
wrap.innerHTML = '';
|
||||
for (const p of profs) {
|
||||
const b = document.createElement('button');
|
||||
b.type = 'button';
|
||||
b.className = 'cookbook-btn hwfit-profile-chip';
|
||||
b.style.cssText = 'height:24px;padding:0 9px;font-size:11px;';
|
||||
const off = p.offloads ? `, ncm${p.n_cpu_moe}` : ', all-GPU';
|
||||
b.textContent = `${p.label} · ${p.quant} · ${Math.round(p.ctx/1024)}k${off}`;
|
||||
b.title = `${p.note}\nKV ${p.cache_type}, ~${p.est_vram_gb} GB VRAM`;
|
||||
b.addEventListener('click', () => {
|
||||
const set = (field, val) => {
|
||||
const el = panel.querySelector(`[data-field="${field}"]`);
|
||||
if (!el) return;
|
||||
if (el.type === 'checkbox') el.checked = !!val; else el.value = val;
|
||||
};
|
||||
set('ctx', p.ctx);
|
||||
set('n_cpu_moe', p.n_cpu_moe || '');
|
||||
set('cache_type', p.cache_type || '');
|
||||
set('flash_attn', true); // required for a quantized KV cache
|
||||
wrap.querySelectorAll('.hwfit-profile-chip').forEach(x => x.classList.remove('cookbook-btn-active'));
|
||||
b.classList.add('cookbook-btn-active');
|
||||
updateCmd();
|
||||
});
|
||||
wrap.appendChild(b);
|
||||
}
|
||||
} catch {
|
||||
wrap.innerHTML = `<span style="opacity:0.5;font-size:11px;">profile compute failed</span>`;
|
||||
}
|
||||
}
|
||||
_loadServeProfiles();
|
||||
|
||||
// Live GPU-memory monitor: poll /api/cookbook/gpus and show VRAM usage +
|
||||
// RAM-spillover, with a plain-language health/speed hint. Lets you tell at
|
||||
// a glance whether the chosen config fits VRAM (fast) or is paging into
|
||||
// system RAM over PCIe (slow). AMD sysfs reports gtt_used_mb for spillover.
|
||||
async function _refreshVramMonitor() {
|
||||
const el = panel.querySelector('.hwfit-vram-readout');
|
||||
if (!el || !document.body.contains(el)) return false; // panel closed → stop
|
||||
try {
|
||||
const host = (_es.remoteHost || '').trim();
|
||||
const params = new URLSearchParams();
|
||||
if (host) {
|
||||
params.set('host', host);
|
||||
const _sp = (_es.servers || []).find(s => s.host === host)?.port;
|
||||
if (_sp) params.set('ssh_port', _sp);
|
||||
}
|
||||
const res = await fetch('/api/cookbook/gpus' + (params.toString() ? '?' + params : ''));
|
||||
const data = await res.json();
|
||||
const gpus = Array.isArray(data) ? data : (data.gpus || []);
|
||||
if (!gpus.length) { el.textContent = 'no GPU detected'; el.style.color = ''; return true; }
|
||||
const g = gpus[0];
|
||||
const usedG = (g.used_mb / 1024), totG = (g.total_mb / 1024);
|
||||
const pct = totG ? Math.round((usedG / totG) * 100) : 0;
|
||||
const freeG = Math.max(0, totG - usedG);
|
||||
const spillG = (g.gtt_used_mb || 0) / 1024;
|
||||
// Color: green < 85%, amber 85-97%, red > 97% or spilling.
|
||||
const spilling = spillG > 0.5 && !g.unified_memory; // unified APUs always use GTT; not a spill
|
||||
let color = 'var(--green, #50fa7b)';
|
||||
if (pct >= 97 || spilling) color = 'var(--red, #ff5555)';
|
||||
else if (pct >= 85) color = 'var(--orange, #ffb86c)';
|
||||
let txt = `${usedG.toFixed(1)} / ${totG.toFixed(1)} GB (${pct}%) · ${freeG.toFixed(1)} GB free`;
|
||||
if (spilling) {
|
||||
txt += ` · ⚠ ${spillG.toFixed(1)} GB spilled to RAM — slow (raise CPU MoE or lower context)`;
|
||||
} else if (pct >= 90) {
|
||||
txt += ` · tight — risk of OOM/spill on long context or images`;
|
||||
} else {
|
||||
txt += ` · healthy`;
|
||||
}
|
||||
el.textContent = txt;
|
||||
el.style.color = color;
|
||||
return true;
|
||||
} catch {
|
||||
el.textContent = 'unavailable';
|
||||
el.style.color = '';
|
||||
return true;
|
||||
}
|
||||
}
|
||||
_refreshVramMonitor();
|
||||
// Poll every 4s while the panel is open; stop when it's removed from the DOM.
|
||||
const _vramTimer = setInterval(async () => {
|
||||
const ok = await _refreshVramMonitor();
|
||||
if (ok === false) clearInterval(_vramTimer);
|
||||
}, 4000);
|
||||
|
||||
// Show/hide backend-specific sections
|
||||
function updateBackendVisibility() {
|
||||
const b = panel.querySelector('[data-field="backend"]')?.value || 'vllm';
|
||||
@@ -1313,6 +1484,12 @@ function _rerenderCachedModels() {
|
||||
// Launch button
|
||||
panel.querySelector('.hwfit-serve-launch').addEventListener('click', async (ev) => {
|
||||
const _launchBtn = ev.currentTarget;
|
||||
// Final safety net: never launch with ctx beyond the model's trained
|
||||
// limit (or the absolute sanity ceiling when the limit is unknown). A
|
||||
// stale preset or typo (e.g. 16000000) overflows and, with a quantized
|
||||
// KV cache, can crash the GPU. Skip only if the user hand-edited the raw
|
||||
// command (then we respect their literal text).
|
||||
if (!_cmdManuallyEdited) _clampCtx(true);
|
||||
if (!_cmdManuallyEdited) updateCmd();
|
||||
const launchCmd = _cmdTextarea ? _cmdTextarea.value.trim() : panel._cmd;
|
||||
const serveState = {};
|
||||
|
||||
@@ -1744,6 +1744,12 @@ body.bg-pattern-sparkles {
|
||||
padding-left: max(0px, calc((100% - var(--chat-max)) / 2));
|
||||
padding-right: max(12px, calc((100% - var(--chat-max)) / 2 + 12px));
|
||||
}
|
||||
/* Sortable Cookbook column headers had no visual cue, so users couldn't tell
|
||||
a header was clickable (the Newest sort on the Model column was invisible).
|
||||
Show a pointer + hover highlight, and underline the active sort column. */
|
||||
.hwfit-header .hwfit-sortable { cursor: pointer; transition: color .12s; }
|
||||
.hwfit-header .hwfit-sortable:hover { color: var(--fg); text-decoration: underline dotted; }
|
||||
.hwfit-header .hwfit-sort-active { color: var(--fg); font-weight: 600; }
|
||||
/* Welcome screen — centered in available space above input bar */
|
||||
#welcome-screen {
|
||||
position:absolute;
|
||||
|
||||
Reference in New Issue
Block a user