Office documents were dropped server-side: .docx fell through to "[Attached document file]", .xlsx/.pptx weren't recognized at all, and the personal-docs RAG index only covered txt/md/json/pdf. Wire the optional markitdown dependency (MIT, Microsoft) into both the chat-attachment path (build_user_content) and the RAG indexer (personal_docs), converting .docx/.xlsx/.pptx/.xls/.epub to Markdown. It is lazy-imported with graceful fallback (mirrors src/pdf_runtime.py): without it those formats show an "install to extract" banner and the MIT core is unaffected. pypdf stays the default PDF path. - src/markitdown_runtime.py: optional-dep loader + convert_to_markdown - upload_handler: recognize Office/EPUB extensions + MIME types - document_processor: extract Office docs in the chat else-branch - personal_docs: index Office docs (DEFAULT_EXTENSIONS + dispatch) - requirements-optional.txt + ACKNOWLEDGMENTS.md: pinned markitdown 0.1.5 - tests: markitdown_runtime + office index coverage Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
26 lines
827 B
Python
26 lines
827 B
Python
from pathlib import Path
|
|
|
|
from src import personal_docs
|
|
|
|
|
|
def test_personal_index_includes_office_uploads(tmp_path, monkeypatch):
|
|
docx_path = tmp_path / "report.docx"
|
|
docx_path.write_bytes(b"PK fake docx bytes")
|
|
|
|
monkeypatch.setattr(
|
|
personal_docs,
|
|
"extract_office_text",
|
|
lambda path: "# Report\n\nreadable office text" if Path(path) == docx_path else "",
|
|
)
|
|
|
|
files = personal_docs.load_personal_index(str(tmp_path))
|
|
|
|
assert [item["name"] for item in files] == ["report.docx"]
|
|
assert files[0]["path"] == str(docx_path)
|
|
assert files[0]["chunks"] == ["# Report\n\nreadable office text"]
|
|
|
|
|
|
def test_personal_index_default_extensions_advertise_office_support():
|
|
for ext in (".docx", ".pptx", ".xlsx", ".xls"):
|
|
assert ext in personal_docs.config.DEFAULT_EXTENSIONS
|