Cookbook scheduler: calendar events drive model serve windows (experimental, feature-flagged)

Add a calendar-driven scheduler so a user can pick a model in Cookbook, click "Schedule…" instead of "Launch", choose time windows + days of the week + (optional) end date, and have Odysseus auto-launch the serve when the window starts and hard-kill it when the window ends. The calendar IS the source of truth — events on a designated calendar are interpreted as serve schedules, so editing the event in the calendar UI immediately changes the schedule. Whole feature is gated by setting `cookbook_scheduler_enabled` (default False). Disabling the setting silences the reconciler and the API refuses requests; setting + three new files = entire surface, easy to revert. New files: - src/cookbook_scheduler.py — background reconciler: ticks every 60s, reads next ±90s of calendar events on the designated calendar, launches/kills serves to match. Honors "refuse if GPUs busy" (skips with reason, no retry). Adopts pre-existing manual serves matching the event's model so window-end cleanup still applies. Tags scheduler-owned tasks with `_scheduledBy: <event_uid>` so it never kills serves it doesn't own. - routes/cookbook_schedule_routes.py — POST /api/cookbook/schedule/from-cookbook builds RRULE+ICS events from the modal's input (model, slots[], days[], until). GET /upcoming returns the next 24h with per-event status (scheduled / running / adopted / skipped / failed / ended) for the UI. POST /reconcile-now manually kicks the reconciler. - static/js/cookbookSchedule.js — Schedule button click handler + modal. Daily/hourly time slot picker, multi-slot ("+ add another time slot"), weekday chips with Weekdays/Weekend/Every-day quicksets, optional Until date. Calls /from-cookbook on save. Whole module is a single IIFE; deleting the file plus its <script> tag removes the UI surface. Existing files touched (minimal): - app.py: register the new router + add the reconcile loop as a startup task (~10 lines, all in one block). Reconcile loop checks the feature flag on every tick, so leaving it running with the flag off costs ~one settings lookup per minute. - static/index.html: one new <script> tag for cookbookSchedule.js. - static/js/cookbookServe.js: add a "Schedule…" button next to the existing Launch button. Hidden by default; cookbookSchedule.js reveals it after confirming the feature flag is on. - static/style.css: ~80 lines for the modal styles (mobile-aware via @media). User choices baked in: - Calendar events are the source of truth. - Refuse to launch if GPUs busy (skip + log reason in scheduler.events[uid].reason). - Hard kill at event end. - No retry on a skipped event within the window. - Multi-slot per day supported (one calendar event per slot, shared RRULE). - Pre-existing manual serves get adopted at window start so they're killed at end. Known follow-ups (not in this commit): - Settings UI to pick the schedule calendar + toggle the feature flag. - Calendar event color/badge for status (running/skipped/failed). - "Lazy launch on first request" — currently launches at event start. Replacing _launch_serve with a proxy that defers vllm until the first chat request is a contained future change.
2026-06-05 02:35:23 +09:00
parent 9112861d8e
commit a19b6d2d4d
7 changed files with 1023 additions and 0 deletions
--- a/src/cookbook_scheduler.py
+++ b/src/cookbook_scheduler.py
@@ -0,0 +1,456 @@
+"""Cookbook scheduler — calendar-driven model launches.
+
+Calendar events on a designated calendar (configurable via setting
+`cookbook_schedule_calendar_href`) are interpreted as serve schedules.
+The reconciler ticks every ~60s, reads events whose window contains
+"now", and reconciles the running serves against them:
+
+  - Event starts in window AND no matching serve running → launch via
+    existing /api/model/serve. If GPU is busy, mark event "skipped"
+    with reason. No retry.
+  - Event ends in window AND a scheduled serve is running → hard-kill.
+  - Pre-existing manual serve matching the event's model → adopt it
+    (mark as owned by the event so it gets stopped at window end).
+
+Everything in this module is gated by setting `cookbook_scheduler_enabled`.
+Setting that to False fully disables the feature without touching code.
+
+Event description format (YAML-ish, single nested key):
+  cookbook:
+    preset: Qwen3.5-397B-A17B-AWQ            # or repo_id + cmd + host
+    repo_id: deepseek-ai/DeepSeek-V4-Flash
+    cmd: vllm serve /mnt/HADES/models/...
+    host: pewds@192.168.1.12
+    port: 8003
+
+If only the title is given, the title is matched against saved preset
+names (case-insensitive substring match).
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import logging
+import re
+import time
+from datetime import datetime, timedelta, timezone
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+
+import httpx
+
+logger = logging.getLogger(__name__)
+
+
+# Schedule-owned tasks are tagged with this so we can tell them apart
+# from manual launches when deciding whether to hard-kill at window end.
+SCHEDULE_OWNER_KEY = "_scheduledBy"
+COOKBOOK_BASE_URL = "http://localhost:7000"
+
+
+def _internal_headers() -> Dict[str, str]:
+    """Match the in-process loopback auth path used by chat-agent tools."""
+    from core.middleware import INTERNAL_TOOL_HEADER, INTERNAL_TOOL_TOKEN
+    return {INTERNAL_TOOL_HEADER: INTERNAL_TOOL_TOKEN}
+
+
+def _parse_event_yaml(description: str) -> Dict[str, Any]:
+    """Pull the `cookbook:` block out of an event description.
+
+    Deliberately tolerant: we don't want a calendar-edit typo (a stray
+    `>`, a tab, etc.) to silently drop the event. Returns {} on any
+    error so the caller falls back to title-match against presets.
+    """
+    if not isinstance(description, str) or "cookbook:" not in description:
+        return {}
+    try:
+        block_start = description.index("cookbook:")
+        block = description[block_start:].split("\n")
+        out: Dict[str, Any] = {}
+        for line in block[1:]:
+            if not line.startswith(("  ", "\t")):
+                # First non-indented line ends the block.
+                if line.strip() == "" and not out:
+                    continue
+                break
+            k, _, v = line.strip().partition(":")
+            v = v.strip().strip("'").strip('"')
+            if k and v:
+                out[k] = v
+        return out
+    except Exception as e:
+        logger.debug(f"event yaml parse failed (ignored): {e}")
+        return {}
+
+
+def _now_utc() -> datetime:
+    return datetime.now(timezone.utc)
+
+
+def _parse_iso(s: str) -> Optional[datetime]:
+    if not s:
+        return None
+    try:
+        # Accept both ISO with and without timezone; assume UTC if naive.
+        s2 = s.replace("Z", "+00:00")
+        dt = datetime.fromisoformat(s2)
+        if dt.tzinfo is None:
+            dt = dt.replace(tzinfo=timezone.utc)
+        return dt
+    except Exception:
+        return None
+
+
+async def _fetch_calendar_events(calendar_href: str, start: datetime, end: datetime) -> List[Dict[str, Any]]:
+    """List events on a single calendar in [start, end].
+
+    Reuses /api/calendar/events. RRULE expansion happens server-side so
+    we get concrete occurrences, not the master recurring event.
+    """
+    headers = _internal_headers()
+    params = {
+        "start": start.isoformat(),
+        "end": end.isoformat(),
+        "calendar": calendar_href,
+    }
+    try:
+        async with httpx.AsyncClient(timeout=15) as client:
+            r = await client.get(
+                f"{COOKBOOK_BASE_URL}/api/calendar/events",
+                params=params, headers=headers,
+            )
+            if r.status_code >= 400:
+                logger.debug(f"calendar/events returned {r.status_code}: {r.text[:200]}")
+                return []
+            data = r.json()
+        return data.get("events", []) if isinstance(data, dict) else []
+    except Exception as e:
+        logger.warning(f"reconciler: failed to fetch calendar events: {e}")
+        return []
+
+
+async def _gpus_busy(host: str) -> bool:
+    """Best-effort: are any GPUs on `host` already under non-trivial load?
+
+    Used to honor "refuse to launch if GPUs busy" semantics. We don't
+    block on a vllm process that's currently loading our OWN target —
+    that's handled separately (idempotent registration). The check is
+    "is there a foreign process holding GPU memory".
+    """
+    headers = _internal_headers()
+    try:
+        async with httpx.AsyncClient(timeout=10) as client:
+            params = {"host": host} if host else {}
+            r = await client.get(
+                f"{COOKBOOK_BASE_URL}/api/cookbook/gpus",
+                params=params, headers=headers,
+            )
+            if r.status_code >= 400:
+                return False
+            data = r.json() or {}
+    except Exception:
+        return False
+    for gpu in data.get("gpus") or []:
+        used_mb = int(gpu.get("used_mb") or 0)
+        # 500 MB threshold: enough to exclude an idle display driver
+        # (usually <300 MB) but catch any real allocation.
+        if used_mb > 500:
+            return True
+    return False
+
+
+def _resolve_event_payload(event: Dict[str, Any], presets: List[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
+    """Turn a calendar event into a serve payload (or None if unschedulable).
+
+    Tries event description's `cookbook:` block first; falls back to a
+    case-insensitive preset-name match against the event title.
+    """
+    parsed = _parse_event_yaml(event.get("description") or "")
+    if parsed.get("repo_id") or parsed.get("cmd"):
+        return {
+            "repo_id": parsed.get("repo_id") or parsed.get("model") or (event.get("summary") or ""),
+            "cmd": parsed.get("cmd") or "",
+            "remote_host": parsed.get("host") or parsed.get("remote_host") or "",
+            "port": parsed.get("port"),
+        }
+    # Title-based preset lookup.
+    title = (event.get("summary") or "").strip()
+    if not title:
+        return None
+    preset_name = parsed.get("preset") or title
+    lname = preset_name.lower()
+    chosen = next(
+        (p for p in presets if isinstance(p, dict) and (p.get("name") or "").lower() == lname),
+        None,
+    )
+    if chosen is None:
+        chosen = next(
+            (p for p in presets if isinstance(p, dict) and lname in (p.get("name") or "").lower()),
+            None,
+        )
+    if chosen is None:
+        return None
+    cmd = (chosen.get("cmd") or "").strip()
+    # Adopted presets have no usable cmd — they can't be relaunched
+    # from the scheduler.
+    if not cmd or cmd.startswith("(adopted"):
+        logger.info(f"scheduler: preset {preset_name!r} has no cmd; cannot schedule")
+        return None
+    return {
+        "repo_id": chosen.get("model") or chosen.get("modelId") or "",
+        "cmd": cmd,
+        "remote_host": chosen.get("host") or chosen.get("remoteHost") or "",
+        "port": chosen.get("port"),
+    }
+
+
+def _state_path() -> Path:
+    return Path("/app/data/cookbook_state.json")
+
+
+def _read_state() -> Dict[str, Any]:
+    p = _state_path()
+    if not p.exists():
+        return {}
+    try:
+        return json.loads(p.read_text(encoding="utf-8"))
+    except Exception:
+        return {}
+
+
+def _write_state(state: Dict[str, Any]) -> None:
+    try:
+        from core.atomic_io import atomic_write_json
+        atomic_write_json(_state_path(), state)
+    except Exception as e:
+        logger.warning(f"scheduler: state write failed: {e}")
+
+
+async def _launch_serve(payload: Dict[str, Any], event_uid: str) -> Optional[str]:
+    """Hit /api/model/serve. Returns session_id on success, None on failure."""
+    headers = _internal_headers()
+    body = {"repo_id": payload["repo_id"], "cmd": payload["cmd"]}
+    if payload.get("remote_host"):
+        body["remote_host"] = payload["remote_host"]
+    # Pull env/gpu/hf_token from the host's saved server entry, same as
+    # the chat agent's serve_model does. Without this, vllm can't find
+    # its venv binaries.
+    try:
+        async with httpx.AsyncClient(timeout=10) as c:
+            r = await c.get(f"{COOKBOOK_BASE_URL}/api/cookbook/state", headers=headers)
+            st = r.json() if r.headers.get("content-type", "").startswith("application/json") else {}
+    except Exception:
+        st = {}
+    env = (st.get("env") or {}) if isinstance(st, dict) else {}
+    servers = env.get("servers") or []
+    target_host = payload.get("remote_host") or ""
+    srv = next(
+        (s for s in servers if isinstance(s, dict)
+         and (s.get("host") == target_host or s.get("name") == target_host)),
+        {},
+    )
+    if srv.get("env") in ("venv", "conda") and srv.get("envPath"):
+        body["env_prefix"] = f"source {srv['envPath']}/bin/activate" if srv["env"] == "venv" else f"conda activate {srv['envPath']}"
+    if srv.get("hfToken"):
+        body["hf_token"] = srv["hfToken"]
+    if srv.get("port"):
+        body["ssh_port"] = str(srv["port"])
+    if srv.get("platform"):
+        body["platform"] = srv["platform"]
+    try:
+        async with httpx.AsyncClient(timeout=30) as client:
+            r = await client.post(f"{COOKBOOK_BASE_URL}/api/model/serve", json=body, headers=headers)
+            data = r.json() if r.content else {}
+    except Exception as e:
+        logger.warning(f"scheduler: launch failed for event {event_uid}: {e}")
+        return None
+    if not data.get("ok"):
+        err = data.get("error") or data.get("detail") or "unknown"
+        logger.warning(f"scheduler: launch rejected for event {event_uid}: {err}")
+        return None
+    return data.get("session_id")
+
+
+async def _stop_serve(session_id: str, host: str) -> None:
+    headers = _internal_headers()
+    try:
+        async with httpx.AsyncClient(timeout=15) as client:
+            await client.post(f"{COOKBOOK_BASE_URL}/api/model/stop",
+                              json={"session_id": session_id, "remote_host": host},
+                              headers=headers)
+    except Exception as e:
+        logger.warning(f"scheduler: stop failed for {session_id}: {e}")
+
+
+def _mark_event_status(state: Dict[str, Any], event_uid: str, status: str,
+                       reason: str = "", session_id: str = "") -> None:
+    """Track per-event reconciliation status in cookbook_state.scheduler.
+
+    Schema:
+      state.scheduler.events = {
+        "<event_uid>": {
+          "status": "running" | "skipped" | "ended" | "failed",
+          "reason": "<short string>",
+          "session_id": "...",
+          "ts": <ms epoch>,
+        },
+        ...
+      }
+    """
+    sched = state.setdefault("scheduler", {})
+    events = sched.setdefault("events", {})
+    events[event_uid] = {
+        "status": status,
+        "reason": reason,
+        "session_id": session_id,
+        "ts": int(time.time() * 1000),
+    }
+
+
+async def _reconcile_once() -> Dict[str, Any]:
+    """One reconciliation pass. Returns a dict for diagnostics + UI.
+
+    Idempotent: running this twice in a row with no event changes
+    should produce the same state without double-launching or
+    double-killing.
+    """
+    from src.settings import get_setting
+    if not get_setting("cookbook_scheduler_enabled", False):
+        return {"skipped": "disabled"}
+    calendar_href = get_setting("cookbook_schedule_calendar_href", "") or ""
+    if not calendar_href:
+        return {"skipped": "no_calendar_configured"}
+
+    now = _now_utc()
+    # Look ±90s around now so a 60s tick still picks up events that
+    # started 30s ago but haven't been reconciled.
+    window_start = now - timedelta(seconds=90)
+    window_end = now + timedelta(seconds=90)
+    events = await _fetch_calendar_events(calendar_href, window_start, window_end)
+    state = _read_state()
+    presets = state.get("presets") or []
+    sched = state.get("scheduler") or {}
+    tracked = sched.get("events") or {}
+
+    out: Dict[str, Any] = {"events": []}
+    state_dirty = False
+
+    # Classify each event by where `now` falls relative to its window.
+    for ev in events:
+        uid = ev.get("uid") or ev.get("id") or ""
+        if not uid:
+            continue
+        ev_start = _parse_iso(ev.get("dtstart") or ev.get("start") or "")
+        ev_end = _parse_iso(ev.get("dtend") or ev.get("end") or "")
+        if ev_start is None or ev_end is None:
+            continue
+        in_window = ev_start <= now < ev_end
+        just_ended = (ev_end <= now) and (now - ev_end) < timedelta(seconds=90)
+        ev_status = (tracked.get(uid) or {}).get("status")
+        ev_session = (tracked.get(uid) or {}).get("session_id")
+
+        if just_ended and ev_session and ev_status in {"running", "adopted"}:
+            # Window closed → hard-kill (per user choice).
+            payload = _resolve_event_payload(ev, presets) or {}
+            host = payload.get("remote_host") or ""
+            await _stop_serve(ev_session, host)
+            _mark_event_status(state, uid, "ended", session_id=ev_session)
+            state_dirty = True
+            out["events"].append({"uid": uid, "status": "ended", "session_id": ev_session})
+            continue
+
+        if not in_window:
+            continue
+
+        # In window. Determine whether a serve already exists for this event.
+        if ev_status == "running" and ev_session:
+            out["events"].append({"uid": uid, "status": "running", "session_id": ev_session})
+            continue
+        if ev_status == "skipped":
+            # User chose: no retry within the window.
+            out["events"].append({"uid": uid, "status": "skipped",
+                                  "reason": (tracked.get(uid) or {}).get("reason", "")})
+            continue
+
+        payload = _resolve_event_payload(ev, presets)
+        if payload is None:
+            _mark_event_status(state, uid, "failed",
+                               reason="no preset or cmd resolvable from event")
+            state_dirty = True
+            out["events"].append({"uid": uid, "status": "failed", "reason": "no preset"})
+            continue
+
+        # Adoption pass: is a non-scheduled serve already running this model?
+        target_host = payload.get("remote_host") or ""
+        for t in state.get("tasks") or []:
+            if not isinstance(t, dict):
+                continue
+            if t.get("type") != "serve":
+                continue
+            if (t.get("status") or "").lower() not in {"running", "ready", "loading", "warming"}:
+                continue
+            if t.get("remoteHost") != target_host:
+                continue
+            t_model = (t.get("payload") or {}).get("repo_id") or t.get("name") or ""
+            if t_model.split("/")[-1] == (payload["repo_id"] or "").split("/")[-1]:
+                t[SCHEDULE_OWNER_KEY] = uid
+                _mark_event_status(state, uid, "adopted",
+                                   reason="pre-existing serve adopted",
+                                   session_id=t.get("sessionId") or t.get("id") or "")
+                state_dirty = True
+                out["events"].append({"uid": uid, "status": "adopted",
+                                      "session_id": t.get("sessionId")})
+                break
+        else:
+            # No matching pre-existing serve → fresh launch path.
+            if await _gpus_busy(target_host):
+                _mark_event_status(state, uid, "skipped",
+                                   reason="GPUs busy at launch time")
+                state_dirty = True
+                out["events"].append({"uid": uid, "status": "skipped",
+                                      "reason": "GPUs busy"})
+                continue
+            sid = await _launch_serve(payload, uid)
+            if sid:
+                _mark_event_status(state, uid, "running",
+                                   reason="launched by scheduler",
+                                   session_id=sid)
+                state_dirty = True
+                # Tag the new task with the schedule owner so window-end
+                # cleanup knows this is ours, not a manual launch.
+                fresh_state = _read_state()
+                for t in fresh_state.get("tasks") or []:
+                    if isinstance(t, dict) and t.get("sessionId") == sid:
+                        t[SCHEDULE_OWNER_KEY] = uid
+                        break
+                _write_state(fresh_state)
+                state_dirty = False  # we just wrote
+                out["events"].append({"uid": uid, "status": "running",
+                                      "session_id": sid})
+            else:
+                _mark_event_status(state, uid, "skipped",
+                                   reason="serve_model rejected launch")
+                state_dirty = True
+                out["events"].append({"uid": uid, "status": "skipped",
+                                      "reason": "launch rejected"})
+
+    if state_dirty:
+        _write_state(state)
+    out["tick_at"] = now.isoformat()
+    return out
+
+
+async def reconcile_loop() -> None:
+    """Forever-loop reconciler. Registered as a startup task in app.py."""
+    # Stagger the first tick so we don't fight the rest of startup for
+    # CPU + I/O.
+    await asyncio.sleep(15)
+    while True:
+        try:
+            result = await _reconcile_once()
+            if result.get("events"):
+                logger.info(f"scheduler tick: {result}")
+        except Exception as e:
+            logger.warning(f"scheduler tick failed: {e}")
+        await asyncio.sleep(60)