Commit Graph

10 Commits

Author SHA1 Message Date
Afonso Coutinho
a880b17624 Skip malformed personal keyword index rows
Make personal keyword retrieval tolerate corrupted non-dict index entries and missing chunk lists, with regression coverage.
2026-06-03 13:42:05 +09:00
Ethan
0e538ecd29 Fix RAG remove_directory wiping the entire shared collection (#1660) (#1734)
Removing one RAG directory destroyed the whole shared ChromaDB collection
(all owners + base index) instead of just that directory's chunks. Shared
root cause: PersonalDocsManager.remove_directory called rebuild_index()
(delete_collection + recreate) then re-indexed only the remaining tracked
dirs (ownerless, never personal_dir). The targeted VectorRAG.remove_directory
that should have been used was itself broken (where={"source":{"$contains":dir}}
selects nothing on scalar metadata and would over-delete siblings), and the
dead do_manage_rag path fired a second unconditional rebuild.

- VectorRAG.remove_directory: select chunks in Python by a path-boundary match
  on the stored absolute `source` (dir or dir+os.sep), abspath-normalized.
  Keys on `source` (always written), never `owner` -- no migration.
- PersonalDocsManager.remove_directory: call the targeted remove instead of
  rebuild_index() + partial reindex.
- do_manage_rag (dead code): drop the second rebuild_index() (hygiene).
- rag_server.py add path: abspath so indexed `source` matches the remove.

No schema change. Prevents future wipes (does not recover already-wiped
vectors). Adds hermetic regression tests at three layers.

Fixes #1660

Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>
2026-06-03 13:29:51 +09:00
Afonso Coutinho
82c09dd768 fix: split_chunks emits a duplicate trailing chunk for text over size-overlap (#1573) 2026-06-03 08:57:54 +09:00
red person
f39c87561b Save only string personal doc paths (#1566) 2026-06-03 08:37:29 +09:00
red person
fd37ccebae Ignore invalid personal docs state (#1401) 2026-06-03 04:02:16 +09:00
SurprisedDuck
62f06ab740 Docs: respect path boundary when clearing exclusions
add_directory cleared exclusions with a raw path.startswith(directory)
test, which also matched sibling directories sharing a name prefix —
adding /docs would silently un-exclude files under /docs2. Match the
directory itself or paths under it (directory + os.sep) instead.
2026-06-02 20:35:44 +09:00
Marius Oppedal Ringsby
f58fbc8b85 Add optional markitdown extraction for Office/EPUB documents (#766)
Office documents were dropped server-side: .docx fell through to
"[Attached document file]", .xlsx/.pptx weren't recognized at all, and
the personal-docs RAG index only covered txt/md/json/pdf.

Wire the optional markitdown dependency (MIT, Microsoft) into both the
chat-attachment path (build_user_content) and the RAG indexer
(personal_docs), converting .docx/.xlsx/.pptx/.xls/.epub to Markdown.
It is lazy-imported with graceful fallback (mirrors src/pdf_runtime.py):
without it those formats show an "install to extract" banner and the
MIT core is unaffected. pypdf stays the default PDF path.

- src/markitdown_runtime.py: optional-dep loader + convert_to_markdown
- upload_handler: recognize Office/EPUB extensions + MIME types
- document_processor: extract Office docs in the chat else-branch
- personal_docs: index Office docs (DEFAULT_EXTENSIONS + dispatch)
- requirements-optional.txt + ACKNOWLEDGMENTS.md: pinned markitdown 0.1.5
- tests: markitdown_runtime + office index coverage

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 11:28:52 +09:00
red person
e1102585bf Fix chat stream recovery and PDF library indexing (#468) 2026-06-01 22:33:35 +09:00
pewdiepie-archdaemon
0888a3b3e6 Add native Windows compatibility layer 2026-06-01 15:09:47 +09:00
pewdiepie-archdaemon
e5c99a5eee Odysseus v1.0 2026-05-31 23:58:26 +09:00