Context
Follow-up to #1442 (data-loss bug in dj.gc.scan / collect, fixed in #1444). After that fix, scan correctly identifies referenced files; the present issue is about the broader concern that GC remains not transaction-safe even with the type-mismatch resolved.
There is a TOCTOU window between when scan_*_references enumerates references and when collect deletes orphans. A concurrent transaction that inserts a row referencing previously-orphaned content during that window will have its file deleted out from under it. This was raised by @dimitri-yatsenko during review of the #1442 root-cause analysis:
In general, garbage collection is not 100% safe even after fixing since it's not transaction-safe. We need to implement a process of garbage retrieval for a bit and then a second step for complete removal.
Proposed direction: two-phase API
Replace the single-call collect() with a quarantine → grace window → purge state machine. The single-call form can stay as a deprecated convenience, but production usage should be the two-phase form.
Phase 1: quarantine()
Two options for state persistence — both worth a design pass:
- Storage prefix (
_trash/). Move objects into a _trash/ prefix under the same store. Atomic per object. Restore is a move-back. Storage layout encodes the quarantine state directly; no separate metadata table to keep in sync.
- Database table (
_gc_quarantine). Record candidates in a project-level (cross-schema) table: (path, store, quarantined_at, source_schema, source_table, source_attribute). Storage layout unchanged; restore is a leave-in-place + delete the row.
Phase 2: purge()
Phase 0 / convenience: restore(path_or_filter)
- Explicit unquarantine for operator use: pulled-too-soon recovery, debugging, manual reclaim.
Open design questions
- Cross-schema scope. Quarantine state spans every schema sharing the store. The
_gc_quarantine table needs a project-level home (not per-schema), or each scan must look up quarantine state across schemas. The _trash/ prefix variant sidesteps this naturally.
- Concurrent-insert handling. What happens if a row is inserted referencing a quarantined path during the grace window? Phase 2's re-check covers it, but should we also block the insert at write time? Probably no — the re-check is cheaper than coordination — but worth deciding explicitly.
- Recovery from interrupted runs. State machine must be resumable: a
quarantine() killed mid-run should leave the system in a defined state, and the next call should pick up where the previous one stopped.
- Storage backend uniformity. The
_trash/ prefix needs atomic move semantics on every supported backend (local, S3, UC Volumes). Most fsspec backends provide this; should be verified per backend.
- CLI ergonomics.
dj.gc.quarantine(...) / dj.gc.purge(...) / dj.gc.restore(...) / dj.gc.format_quarantine_stats(...). Same *schemas + store_name shape as today.
- Backwards compatibility. Keep
collect(dry_run=False) as a single-call shorthand that does quarantine + purge in sequence (with grace_seconds=0), but emit a DeprecationWarning recommending the two-phase form for production. Default dry_run=True already protects against accidental runs.
- Operational visibility. Quarantine listing / stats should be queryable: how much is currently quarantined, oldest item, stores touched, expected purge time.
Industry references
The pattern is well-established for non-transactional GC across an external store + a transactional DB:
- Cassandra tombstones with
gc_grace_seconds (default 10 days)
- Databricks
VACUUM with retention period (default 7 days)
- S3 lifecycle soft-delete + permanent-delete
- POSIX deferred unlink when a file has open handles
In each case the grace window absorbs in-flight transactions that the GC can't see at scan time.
Scope
This is a design + implementation request. The right next step is a written spec covering:
- State persistence choice (
_trash/ prefix vs. _gc_quarantine table).
- Public API surface (
quarantine / purge / restore / config keys).
- Concurrency model (re-check before purge, behavior on interrupted runs).
- Migration / compatibility (what
collect() does going forward).
- Test plan including concurrent-insert race coverage.
Happy to draft that spec; flagging here so it doesn't get lost.
Context
Follow-up to #1442 (data-loss bug in
dj.gc.scan/collect, fixed in #1444). After that fix,scancorrectly identifies referenced files; the present issue is about the broader concern that GC remains not transaction-safe even with the type-mismatch resolved.There is a TOCTOU window between when
scan_*_referencesenumerates references and whencollectdeletes orphans. A concurrent transaction that inserts a row referencing previously-orphaned content during that window will have its file deleted out from under it. This was raised by @dimitri-yatsenko during review of the #1442 root-cause analysis:Proposed direction: two-phase API
Replace the single-call
collect()with a quarantine → grace window → purge state machine. The single-call form can stay as a deprecated convenience, but production usage should be the two-phase form.Phase 1:
quarantine()scandoes (now correct after Critical:dj_gc.scanreportsreferenced: 0for every storage column;collectwould delete live data #1442).Two options for state persistence — both worth a design pass:
_trash/). Move objects into a_trash/prefix under the same store. Atomic per object. Restore is a move-back. Storage layout encodes the quarantine state directly; no separate metadata table to keep in sync._gc_quarantine). Record candidates in a project-level (cross-schema) table:(path, store, quarantined_at, source_schema, source_table, source_attribute). Storage layout unchanged; restore is a leave-in-place + delete the row.Phase 2:
purge()quarantined_atis older thangrace_seconds(configurable, sensible default >= 24h, configurable viadj.config["gc.grace_seconds"]or similar).dj_gc.scanreportsreferenced: 0for every storage column;collectwould delete live data #1442). If a row inserted during the grace window now references the candidate, unquarantine it instead of deleting.Phase 0 / convenience:
restore(path_or_filter)Open design questions
_gc_quarantinetable needs a project-level home (not per-schema), or each scan must look up quarantine state across schemas. The_trash/prefix variant sidesteps this naturally.quarantine()killed mid-run should leave the system in a defined state, and the next call should pick up where the previous one stopped._trash/prefix needs atomic move semantics on every supported backend (local, S3, UC Volumes). Most fsspec backends provide this; should be verified per backend.dj.gc.quarantine(...)/dj.gc.purge(...)/dj.gc.restore(...)/dj.gc.format_quarantine_stats(...). Same*schemas+store_nameshape as today.collect(dry_run=False)as a single-call shorthand that does quarantine + purge in sequence (withgrace_seconds=0), but emit aDeprecationWarningrecommending the two-phase form for production. Defaultdry_run=Truealready protects against accidental runs.Industry references
The pattern is well-established for non-transactional GC across an external store + a transactional DB:
gc_grace_seconds(default 10 days)VACUUMwith retention period (default 7 days)In each case the grace window absorbs in-flight transactions that the GC can't see at scan time.
Scope
This is a design + implementation request. The right next step is a written spec covering:
_trash/prefix vs._gc_quarantinetable).quarantine/purge/restore/ config keys).collect()does going forward).Happy to draft that spec; flagging here so it doesn't get lost.