odysseus/tests/test_document_pdf_marker.py at dev

Files

SurprisedDuck 78747b56ca Documents: strip PDF marker without corrupting text

_process_pdf prepends "\n\n[PDF content]:" to extracted text, and two
call sites in document_routes.py stripped it with .lstrip("\n[PDF content]:").
str.lstrip(chars) treats its argument as a *set of characters*, so it keeps
eating into the page text that follows the marker — e.g. a body starting
with "to the board" loses its leading "to" because 't'/'o' are in the
marker's character set. Replace both sites with a shared
strip_pdf_content_marker() helper that uses str.removeprefix.

2026-06-02 20:35:27 +09:00

1.2 KiB

Raw Permalink Blame History

View Raw

1.2 KiB Raw Permalink Blame History

1.2 KiB

Raw Permalink Blame History