fix(history): scope topic analysis to authenticated owner only (#744)

Two changes close the cross-tenant topic leak in /api/conversations/topics.

The route at routes/history_routes.py:478 used get_current_user, which
returns None when no auth middleware has set request.state.current_user
(loopback-bypass, AUTH_ENABLED=false, or any path that short-circuits the
middleware). It then forwarded owner=None to analyze_topics.

The helper at src/topic_analyzer.py:21 used an 'if owner:' short-circuit
in its owner filter, so the None owner took the no-filter path and the
helper silently aggregated topic frequencies and per-snippet session_id,
session_name, role, and snippet text across every user's sessions.

analyze_topics now returns an empty result when owner is falsy. The
inner short-circuit is removed because the filter is now strict by
construction. The route is switched to require_user, which raises 401
when auth_manager.is_configured is True and the caller is anonymous,
matching the pattern used by calendar_routes, skills_routes, and other
authenticated routes.

The test test_history_topics_owner_scope.py was rewritten to drive the
real route through FastAPI's TestClient with a stub AuthMiddleware that
mirrors the loopback-bypass branch, and now asserts a strict 401 from
the route and an empty result from the helper. The previous version of
the test accepted either a 200-with-empty-topics or a 401; the strict
assertion means a future regression that drops the require_user wrapper
or re-adds the inner short-circuit is caught immediately.
This commit is contained in:
Ernest Hysa
2026-06-02 03:36:01 +01:00
committed by GitHub
parent 1cc2e90ac0
commit 360bc83a66
3 changed files with 300 additions and 9 deletions

View File

@@ -23,20 +23,31 @@ def analyze_topics(session_manager, owner: str = None) -> Dict[str, Any]:
Scan non-archived sessions and return topic frequency data.
If owner is set, only include sessions belonging to that user.
When `owner` is None or empty the helper returns an empty result. The
unauthenticated-loopback path in `app.py` produces a None owner, and
silently aggregating topic frequencies in that case is a cross-tenant
data leak. Callers that want a system-wide aggregate must pass an
explicit `owner` string (e.g. a documented "admin" pseudo-owner) or
the route must reject the request with 401.
Returns dict with "topics" list and "total_topics" count.
"""
if not owner:
return {"topics": [], "total_topics": 0}
topic_counts: Dict[str, int] = {t: 0 for t in TOPIC_KEYWORDS}
topic_matches: Dict[str, list] = {t: [] for t in TOPIC_KEYWORDS}
for session_id, session_data in session_manager.sessions.items():
if session_data.get("archived", False):
continue
# SECURITY: strict ownership — the previous predicate let any
# null-owner session feed into another user's topic analysis.
if owner:
sess_owner = session_data.get("owner") or getattr(session_data, "owner", None)
if sess_owner != owner:
continue
# Strict ownership: any session whose owner does not match the
# caller is excluded. Ownerless sessions are never included
# unless the caller is itself ownerless (which the early return
# above already prevents).
sess_owner = session_data.get("owner") or getattr(session_data, "owner", None)
if sess_owner != owner:
continue
for msg in session_data.get("history", []):
content_raw = msg.get("content") if isinstance(msg, dict) else getattr(msg, "content", None)