Commit Graph

3 Commits

Author SHA1 Message Date
pewdiepie-archdaemon
ed7956cbd3 Owner-scope RAG doc ids so identical chunks across users don't collide (#1738, #1760)
_generate_doc_id hashed only text. add_document / add_documents_batch
early-return when the id exists, so the second owner indexing a
byte-identical chunk hit the first owner's id, was silently dropped,
and never stored under their owner — their owner-filtered search then
quietly omitted it. Hash owner + text; empty owner reproduces the
legacy id, so the unowned/base index keeps existing ids and isn't
re-churned. Same-owner identical chunks still dedupe.

Caught by #1738 and #1760 (independent reports of the same bug).
2026-06-03 11:36:31 +09:00
Tatlatat
dc8a882f1f fix(rag): use a stable hash for document IDs so dedup survives restarts (#1098)
add_document() and add_documents_batch() derive the persistent ChromaDB
document id from Python's built-in hash():

    doc_id = f"doc_{hash(text) % 10**16}"

str hashing is randomized per process (PYTHONHASHSEED is on by default), so
the same document text gets a different doc_id on every restart. The dedup
check right after — self._collection.get(ids=[doc_id]) — therefore misses
on restart, and identical documents are re-embedded and re-added as
duplicates each time the app restarts, bloating the vector store and
skewing retrieval.

Derive the id from a stable hashlib.sha256 of the text via a shared
_generate_doc_id() helper, used by both add paths so they agree.

tests/test_rag_vector_id_stability.py runs _generate_doc_id in subprocesses
under PYTHONHASHSEED=0/1/random and asserts the id is identical across all
of them (and differs for different text). Fails before this change.
2026-06-02 22:42:23 +09:00
pewdiepie-archdaemon
e5c99a5eee Odysseus v1.0 2026-05-31 23:58:26 +09:00