Commit Graph

1 Commits

Author SHA1 Message Date
Ernest Hysa
064c1ace91 Uploads: write uploads index atomically
* fix(upload): atomic-rename writes for uploads.json + .bak recovery

UploadHandler.save_upload does a read-modify-write of uploads.json via
two open(..., 'w') + json.dump blocks, with no lock, no temp+rename, and
no recovery. N concurrent inserts lost N-1 entries (last writer wins
after the read snapshot is taken); a SIGKILL/SIGTERM mid-json.dump
truncated the file and the bare 'except Exception: logger.warning(...)'
recovery path returned {}, silently dropping every prior upload.

The handler now serialises the RMW under a per-instance threading.Lock
and writes through _atomic_write_json, which writes to a tempfile in
the same directory, fsyncs, snapshots the previous live to .bak, and
renames the temp onto the target via os.replace. os.replace is atomic
on POSIX, so a reader sees either the old or the new state, never a
half-written file. _load_upload_index tries the live file first, then
falls back to the .bak sibling if the live is corrupt.

Cross-process safety is still on the deployer: gunicorn workers on
the same uploads dir will race the lock, and the atomic-rename is the
kernel-level guarantee that prevents torn reads. If multi-worker
writes are expected, fcntl.flock around the rename is a follow-up;
single-worker and async deployments are correct as-is.

* fix(upload): reload uploads.json inside _index_lock on dedupe path

The duplicate-detection branch in save_upload() was reading uploads.json
*before* taking _index_lock, then writing that stale snapshot under the
lock. A duplicate upload racing with a new-entry insert could clobber
the new entry because the duplicate's snapshot predated the insert.

The new-entry branch already reloaded inside the lock; the duplicate
branch now does the same. It also re-resolves the storage key inside
the lock, because a concurrent insert can have changed the dict's keys.

If the entry has been cleaned up between the outer read and the inner
write, the function falls through to the fresh-insert path instead of
silently writing a stale row.

Boundary note: the _index_lock serialises writers within a single
Python process. Cross-process / multi-worker deployments still need
flock or a database; the inline comment is updated to make this
explicit. The atomic-rename write keeps the on-disk state consistent
but does not serialise writers across processes.

Tests:
- Existing concurrent-insert and partial-write-recovery tests still pass.
- New test_atomic_write_primitives_present_in_production_code asserts
  the production module has at least two 'with self._index_lock:' blocks
  (regression net for this fix).
- New smoke tests: normal upload, duplicate detection, info lookup
  after a backup-recovery scenario.
2026-06-02 20:51:39 +09:00