Cookbook scheduler: calendar events drive model serve windows (experimental, feature-flagged)

Add a calendar-driven scheduler so a user can pick a model in Cookbook, click "Schedule…" instead of "Launch", choose time windows + days of the week + (optional) end date, and have Odysseus auto-launch the serve when the window starts and hard-kill it when the window ends. The calendar IS the source of truth — events on a designated calendar are interpreted as serve schedules, so editing the event in the calendar UI immediately changes the schedule.

Whole feature is gated by setting `cookbook_scheduler_enabled` (default False). Disabling the setting silences the reconciler and the API refuses requests; setting + three new files = entire surface, easy to revert.

New files:
  - src/cookbook_scheduler.py — background reconciler: ticks every 60s, reads next ±90s of calendar events on the designated calendar, launches/kills serves to match. Honors "refuse if GPUs busy" (skips with reason, no retry). Adopts pre-existing manual serves matching the event's model so window-end cleanup still applies. Tags scheduler-owned tasks with `_scheduledBy: <event_uid>` so it never kills serves it doesn't own.
  - routes/cookbook_schedule_routes.py — POST /api/cookbook/schedule/from-cookbook builds RRULE+ICS events from the modal's input (model, slots[], days[], until). GET /upcoming returns the next 24h with per-event status (scheduled / running / adopted / skipped / failed / ended) for the UI. POST /reconcile-now manually kicks the reconciler.
  - static/js/cookbookSchedule.js — Schedule button click handler + modal. Daily/hourly time slot picker, multi-slot ("+ add another time slot"), weekday chips with Weekdays/Weekend/Every-day quicksets, optional Until date. Calls /from-cookbook on save. Whole module is a single IIFE; deleting the file plus its <script> tag removes the UI surface.

Existing files touched (minimal):
  - app.py: register the new router + add the reconcile loop as a startup task (~10 lines, all in one block). Reconcile loop checks the feature flag on every tick, so leaving it running with the flag off costs ~one settings lookup per minute.
  - static/index.html: one new <script> tag for cookbookSchedule.js.
  - static/js/cookbookServe.js: add a "Schedule…" button next to the existing Launch button. Hidden by default; cookbookSchedule.js reveals it after confirming the feature flag is on.
  - static/style.css: ~80 lines for the modal styles (mobile-aware via @media).

User choices baked in:
  - Calendar events are the source of truth.
  - Refuse to launch if GPUs busy (skip + log reason in scheduler.events[uid].reason).
  - Hard kill at event end.
  - No retry on a skipped event within the window.
  - Multi-slot per day supported (one calendar event per slot, shared RRULE).
  - Pre-existing manual serves get adopted at window start so they're killed at end.

Known follow-ups (not in this commit):
  - Settings UI to pick the schedule calendar + toggle the feature flag.
  - Calendar event color/badge for status (running/skipped/failed).
  - "Lazy launch on first request" — currently launches at event start. Replacing _launch_serve with a proxy that defers vllm until the first chat request is a contained future change.
This commit is contained in:
pewdiepie-archdaemon
2026-06-05 02:35:23 +09:00
parent 9112861d8e
commit a19b6d2d4d
7 changed files with 1023 additions and 0 deletions

456
src/cookbook_scheduler.py Normal file
View File

@@ -0,0 +1,456 @@
"""Cookbook scheduler — calendar-driven model launches.
Calendar events on a designated calendar (configurable via setting
`cookbook_schedule_calendar_href`) are interpreted as serve schedules.
The reconciler ticks every ~60s, reads events whose window contains
"now", and reconciles the running serves against them:
- Event starts in window AND no matching serve running → launch via
existing /api/model/serve. If GPU is busy, mark event "skipped"
with reason. No retry.
- Event ends in window AND a scheduled serve is running → hard-kill.
- Pre-existing manual serve matching the event's model → adopt it
(mark as owned by the event so it gets stopped at window end).
Everything in this module is gated by setting `cookbook_scheduler_enabled`.
Setting that to False fully disables the feature without touching code.
Event description format (YAML-ish, single nested key):
cookbook:
preset: Qwen3.5-397B-A17B-AWQ # or repo_id + cmd + host
repo_id: deepseek-ai/DeepSeek-V4-Flash
cmd: vllm serve /mnt/HADES/models/...
host: pewds@192.168.1.12
port: 8003
If only the title is given, the title is matched against saved preset
names (case-insensitive substring match).
"""
from __future__ import annotations
import asyncio
import json
import logging
import re
import time
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
import httpx
logger = logging.getLogger(__name__)
# Schedule-owned tasks are tagged with this so we can tell them apart
# from manual launches when deciding whether to hard-kill at window end.
SCHEDULE_OWNER_KEY = "_scheduledBy"
COOKBOOK_BASE_URL = "http://localhost:7000"
def _internal_headers() -> Dict[str, str]:
"""Match the in-process loopback auth path used by chat-agent tools."""
from core.middleware import INTERNAL_TOOL_HEADER, INTERNAL_TOOL_TOKEN
return {INTERNAL_TOOL_HEADER: INTERNAL_TOOL_TOKEN}
def _parse_event_yaml(description: str) -> Dict[str, Any]:
"""Pull the `cookbook:` block out of an event description.
Deliberately tolerant: we don't want a calendar-edit typo (a stray
`>`, a tab, etc.) to silently drop the event. Returns {} on any
error so the caller falls back to title-match against presets.
"""
if not isinstance(description, str) or "cookbook:" not in description:
return {}
try:
block_start = description.index("cookbook:")
block = description[block_start:].split("\n")
out: Dict[str, Any] = {}
for line in block[1:]:
if not line.startswith((" ", "\t")):
# First non-indented line ends the block.
if line.strip() == "" and not out:
continue
break
k, _, v = line.strip().partition(":")
v = v.strip().strip("'").strip('"')
if k and v:
out[k] = v
return out
except Exception as e:
logger.debug(f"event yaml parse failed (ignored): {e}")
return {}
def _now_utc() -> datetime:
return datetime.now(timezone.utc)
def _parse_iso(s: str) -> Optional[datetime]:
if not s:
return None
try:
# Accept both ISO with and without timezone; assume UTC if naive.
s2 = s.replace("Z", "+00:00")
dt = datetime.fromisoformat(s2)
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt
except Exception:
return None
async def _fetch_calendar_events(calendar_href: str, start: datetime, end: datetime) -> List[Dict[str, Any]]:
"""List events on a single calendar in [start, end].
Reuses /api/calendar/events. RRULE expansion happens server-side so
we get concrete occurrences, not the master recurring event.
"""
headers = _internal_headers()
params = {
"start": start.isoformat(),
"end": end.isoformat(),
"calendar": calendar_href,
}
try:
async with httpx.AsyncClient(timeout=15) as client:
r = await client.get(
f"{COOKBOOK_BASE_URL}/api/calendar/events",
params=params, headers=headers,
)
if r.status_code >= 400:
logger.debug(f"calendar/events returned {r.status_code}: {r.text[:200]}")
return []
data = r.json()
return data.get("events", []) if isinstance(data, dict) else []
except Exception as e:
logger.warning(f"reconciler: failed to fetch calendar events: {e}")
return []
async def _gpus_busy(host: str) -> bool:
"""Best-effort: are any GPUs on `host` already under non-trivial load?
Used to honor "refuse to launch if GPUs busy" semantics. We don't
block on a vllm process that's currently loading our OWN target —
that's handled separately (idempotent registration). The check is
"is there a foreign process holding GPU memory".
"""
headers = _internal_headers()
try:
async with httpx.AsyncClient(timeout=10) as client:
params = {"host": host} if host else {}
r = await client.get(
f"{COOKBOOK_BASE_URL}/api/cookbook/gpus",
params=params, headers=headers,
)
if r.status_code >= 400:
return False
data = r.json() or {}
except Exception:
return False
for gpu in data.get("gpus") or []:
used_mb = int(gpu.get("used_mb") or 0)
# 500 MB threshold: enough to exclude an idle display driver
# (usually <300 MB) but catch any real allocation.
if used_mb > 500:
return True
return False
def _resolve_event_payload(event: Dict[str, Any], presets: List[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
"""Turn a calendar event into a serve payload (or None if unschedulable).
Tries event description's `cookbook:` block first; falls back to a
case-insensitive preset-name match against the event title.
"""
parsed = _parse_event_yaml(event.get("description") or "")
if parsed.get("repo_id") or parsed.get("cmd"):
return {
"repo_id": parsed.get("repo_id") or parsed.get("model") or (event.get("summary") or ""),
"cmd": parsed.get("cmd") or "",
"remote_host": parsed.get("host") or parsed.get("remote_host") or "",
"port": parsed.get("port"),
}
# Title-based preset lookup.
title = (event.get("summary") or "").strip()
if not title:
return None
preset_name = parsed.get("preset") or title
lname = preset_name.lower()
chosen = next(
(p for p in presets if isinstance(p, dict) and (p.get("name") or "").lower() == lname),
None,
)
if chosen is None:
chosen = next(
(p for p in presets if isinstance(p, dict) and lname in (p.get("name") or "").lower()),
None,
)
if chosen is None:
return None
cmd = (chosen.get("cmd") or "").strip()
# Adopted presets have no usable cmd — they can't be relaunched
# from the scheduler.
if not cmd or cmd.startswith("(adopted"):
logger.info(f"scheduler: preset {preset_name!r} has no cmd; cannot schedule")
return None
return {
"repo_id": chosen.get("model") or chosen.get("modelId") or "",
"cmd": cmd,
"remote_host": chosen.get("host") or chosen.get("remoteHost") or "",
"port": chosen.get("port"),
}
def _state_path() -> Path:
return Path("/app/data/cookbook_state.json")
def _read_state() -> Dict[str, Any]:
p = _state_path()
if not p.exists():
return {}
try:
return json.loads(p.read_text(encoding="utf-8"))
except Exception:
return {}
def _write_state(state: Dict[str, Any]) -> None:
try:
from core.atomic_io import atomic_write_json
atomic_write_json(_state_path(), state)
except Exception as e:
logger.warning(f"scheduler: state write failed: {e}")
async def _launch_serve(payload: Dict[str, Any], event_uid: str) -> Optional[str]:
"""Hit /api/model/serve. Returns session_id on success, None on failure."""
headers = _internal_headers()
body = {"repo_id": payload["repo_id"], "cmd": payload["cmd"]}
if payload.get("remote_host"):
body["remote_host"] = payload["remote_host"]
# Pull env/gpu/hf_token from the host's saved server entry, same as
# the chat agent's serve_model does. Without this, vllm can't find
# its venv binaries.
try:
async with httpx.AsyncClient(timeout=10) as c:
r = await c.get(f"{COOKBOOK_BASE_URL}/api/cookbook/state", headers=headers)
st = r.json() if r.headers.get("content-type", "").startswith("application/json") else {}
except Exception:
st = {}
env = (st.get("env") or {}) if isinstance(st, dict) else {}
servers = env.get("servers") or []
target_host = payload.get("remote_host") or ""
srv = next(
(s for s in servers if isinstance(s, dict)
and (s.get("host") == target_host or s.get("name") == target_host)),
{},
)
if srv.get("env") in ("venv", "conda") and srv.get("envPath"):
body["env_prefix"] = f"source {srv['envPath']}/bin/activate" if srv["env"] == "venv" else f"conda activate {srv['envPath']}"
if srv.get("hfToken"):
body["hf_token"] = srv["hfToken"]
if srv.get("port"):
body["ssh_port"] = str(srv["port"])
if srv.get("platform"):
body["platform"] = srv["platform"]
try:
async with httpx.AsyncClient(timeout=30) as client:
r = await client.post(f"{COOKBOOK_BASE_URL}/api/model/serve", json=body, headers=headers)
data = r.json() if r.content else {}
except Exception as e:
logger.warning(f"scheduler: launch failed for event {event_uid}: {e}")
return None
if not data.get("ok"):
err = data.get("error") or data.get("detail") or "unknown"
logger.warning(f"scheduler: launch rejected for event {event_uid}: {err}")
return None
return data.get("session_id")
async def _stop_serve(session_id: str, host: str) -> None:
headers = _internal_headers()
try:
async with httpx.AsyncClient(timeout=15) as client:
await client.post(f"{COOKBOOK_BASE_URL}/api/model/stop",
json={"session_id": session_id, "remote_host": host},
headers=headers)
except Exception as e:
logger.warning(f"scheduler: stop failed for {session_id}: {e}")
def _mark_event_status(state: Dict[str, Any], event_uid: str, status: str,
reason: str = "", session_id: str = "") -> None:
"""Track per-event reconciliation status in cookbook_state.scheduler.
Schema:
state.scheduler.events = {
"<event_uid>": {
"status": "running" | "skipped" | "ended" | "failed",
"reason": "<short string>",
"session_id": "...",
"ts": <ms epoch>,
},
...
}
"""
sched = state.setdefault("scheduler", {})
events = sched.setdefault("events", {})
events[event_uid] = {
"status": status,
"reason": reason,
"session_id": session_id,
"ts": int(time.time() * 1000),
}
async def _reconcile_once() -> Dict[str, Any]:
"""One reconciliation pass. Returns a dict for diagnostics + UI.
Idempotent: running this twice in a row with no event changes
should produce the same state without double-launching or
double-killing.
"""
from src.settings import get_setting
if not get_setting("cookbook_scheduler_enabled", False):
return {"skipped": "disabled"}
calendar_href = get_setting("cookbook_schedule_calendar_href", "") or ""
if not calendar_href:
return {"skipped": "no_calendar_configured"}
now = _now_utc()
# Look ±90s around now so a 60s tick still picks up events that
# started 30s ago but haven't been reconciled.
window_start = now - timedelta(seconds=90)
window_end = now + timedelta(seconds=90)
events = await _fetch_calendar_events(calendar_href, window_start, window_end)
state = _read_state()
presets = state.get("presets") or []
sched = state.get("scheduler") or {}
tracked = sched.get("events") or {}
out: Dict[str, Any] = {"events": []}
state_dirty = False
# Classify each event by where `now` falls relative to its window.
for ev in events:
uid = ev.get("uid") or ev.get("id") or ""
if not uid:
continue
ev_start = _parse_iso(ev.get("dtstart") or ev.get("start") or "")
ev_end = _parse_iso(ev.get("dtend") or ev.get("end") or "")
if ev_start is None or ev_end is None:
continue
in_window = ev_start <= now < ev_end
just_ended = (ev_end <= now) and (now - ev_end) < timedelta(seconds=90)
ev_status = (tracked.get(uid) or {}).get("status")
ev_session = (tracked.get(uid) or {}).get("session_id")
if just_ended and ev_session and ev_status in {"running", "adopted"}:
# Window closed → hard-kill (per user choice).
payload = _resolve_event_payload(ev, presets) or {}
host = payload.get("remote_host") or ""
await _stop_serve(ev_session, host)
_mark_event_status(state, uid, "ended", session_id=ev_session)
state_dirty = True
out["events"].append({"uid": uid, "status": "ended", "session_id": ev_session})
continue
if not in_window:
continue
# In window. Determine whether a serve already exists for this event.
if ev_status == "running" and ev_session:
out["events"].append({"uid": uid, "status": "running", "session_id": ev_session})
continue
if ev_status == "skipped":
# User chose: no retry within the window.
out["events"].append({"uid": uid, "status": "skipped",
"reason": (tracked.get(uid) or {}).get("reason", "")})
continue
payload = _resolve_event_payload(ev, presets)
if payload is None:
_mark_event_status(state, uid, "failed",
reason="no preset or cmd resolvable from event")
state_dirty = True
out["events"].append({"uid": uid, "status": "failed", "reason": "no preset"})
continue
# Adoption pass: is a non-scheduled serve already running this model?
target_host = payload.get("remote_host") or ""
for t in state.get("tasks") or []:
if not isinstance(t, dict):
continue
if t.get("type") != "serve":
continue
if (t.get("status") or "").lower() not in {"running", "ready", "loading", "warming"}:
continue
if t.get("remoteHost") != target_host:
continue
t_model = (t.get("payload") or {}).get("repo_id") or t.get("name") or ""
if t_model.split("/")[-1] == (payload["repo_id"] or "").split("/")[-1]:
t[SCHEDULE_OWNER_KEY] = uid
_mark_event_status(state, uid, "adopted",
reason="pre-existing serve adopted",
session_id=t.get("sessionId") or t.get("id") or "")
state_dirty = True
out["events"].append({"uid": uid, "status": "adopted",
"session_id": t.get("sessionId")})
break
else:
# No matching pre-existing serve → fresh launch path.
if await _gpus_busy(target_host):
_mark_event_status(state, uid, "skipped",
reason="GPUs busy at launch time")
state_dirty = True
out["events"].append({"uid": uid, "status": "skipped",
"reason": "GPUs busy"})
continue
sid = await _launch_serve(payload, uid)
if sid:
_mark_event_status(state, uid, "running",
reason="launched by scheduler",
session_id=sid)
state_dirty = True
# Tag the new task with the schedule owner so window-end
# cleanup knows this is ours, not a manual launch.
fresh_state = _read_state()
for t in fresh_state.get("tasks") or []:
if isinstance(t, dict) and t.get("sessionId") == sid:
t[SCHEDULE_OWNER_KEY] = uid
break
_write_state(fresh_state)
state_dirty = False # we just wrote
out["events"].append({"uid": uid, "status": "running",
"session_id": sid})
else:
_mark_event_status(state, uid, "skipped",
reason="serve_model rejected launch")
state_dirty = True
out["events"].append({"uid": uid, "status": "skipped",
"reason": "launch rejected"})
if state_dirty:
_write_state(state)
out["tick_at"] = now.isoformat()
return out
async def reconcile_loop() -> None:
"""Forever-loop reconciler. Registered as a startup task in app.py."""
# Stagger the first tick so we don't fight the rest of startup for
# CPU + I/O.
await asyncio.sleep(15)
while True:
try:
result = await _reconcile_once()
if result.get("events"):
logger.info(f"scheduler tick: {result}")
except Exception as e:
logger.warning(f"scheduler tick failed: {e}")
await asyncio.sleep(60)