security: sanitize rendered research-report HTML (#364)

The visual research report is assembled from LLM output over crawled web
pages (untrusted content) and served under a relaxed `script-src
'unsafe-inline'` CSP. Two values reached that HTML without sanitization:

- `_md_to_html` rendered the report markdown via python-markdown, which
  passes raw HTML through verbatim, so `<script>` / `<img onerror>` /
  `<svg onload>` / `javascript:` links carried in crawled content ran in
  the app origin.
- `category` (from the /api/research/start request body, no enum check) was
  interpolated raw into `<body class="category-{category}">`.

Allowlist-sanitize the rendered markdown with nh3, keeping the formatting
the report emits (tables, code, details/summary, toc anchors, codehilite
classes, external-link target/rel) while dropping active content, and
html.escape the category. Adds regression tests.
This commit is contained in:
Joeseph Grey
2026-06-04 06:42:49 -06:00
committed by GitHub
parent 594775dc4b
commit fa1fe7f866
3 changed files with 106 additions and 2 deletions

View File

@@ -21,6 +21,10 @@ youtube-transcript-api
# Markdown rendering for research reports (src/visual_report.py).
# Imported at module-top so it's a hard core dep, not optional.
markdown
# HTML sanitizer for rendered research reports (src/visual_report.py). Report
# content is untrusted (LLM output over crawled pages) and report pages run
# under a relaxed CSP, so the rendered HTML is allowlist-sanitized.
nh3
# Calendar .ics import/export (routes/calendar_routes.py).
icalendar
# Recurrence rule expansion for calendar events (routes/calendar_routes.py).