Commit Graph

3 Commits

Author SHA1 Message Date
Alexandre Teixeira
ea1079e1df docs: fix stale documentation references (#1769) 2026-06-03 13:23:21 +09:00
Marius Oppedal Ringsby
f58fbc8b85 Add optional markitdown extraction for Office/EPUB documents (#766)
Office documents were dropped server-side: .docx fell through to
"[Attached document file]", .xlsx/.pptx weren't recognized at all, and
the personal-docs RAG index only covered txt/md/json/pdf.

Wire the optional markitdown dependency (MIT, Microsoft) into both the
chat-attachment path (build_user_content) and the RAG indexer
(personal_docs), converting .docx/.xlsx/.pptx/.xls/.epub to Markdown.
It is lazy-imported with graceful fallback (mirrors src/pdf_runtime.py):
without it those formats show an "install to extract" banner and the
MIT core is unaffected. pypdf stays the default PDF path.

- src/markitdown_runtime.py: optional-dep loader + convert_to_markdown
- upload_handler: recognize Office/EPUB extensions + MIME types
- document_processor: extract Office docs in the chat else-branch
- personal_docs: index Office docs (DEFAULT_EXTENSIONS + dispatch)
- requirements-optional.txt + ACKNOWLEDGMENTS.md: pinned markitdown 0.1.5
- tests: markitdown_runtime + office index coverage

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 11:28:52 +09:00
pewdiepie-archdaemon
e5c99a5eee Odysseus v1.0 2026-05-31 23:58:26 +09:00