Skip to content

Two-phase, transaction-safe garbage collection (quarantine -> grace -> purge) #1445

@kushalbakshi

Description

@kushalbakshi

Context

Follow-up to #1442 (data-loss bug in dj.gc.scan / collect, fixed in #1444). After that fix, scan correctly identifies referenced files; the present issue is about the broader concern that GC remains not transaction-safe even with the type-mismatch resolved.

There is a TOCTOU window between when scan_*_references enumerates references and when collect deletes orphans. A concurrent transaction that inserts a row referencing previously-orphaned content during that window will have its file deleted out from under it. This was raised by @dimitri-yatsenko during review of the #1442 root-cause analysis:

In general, garbage collection is not 100% safe even after fixing since it's not transaction-safe. We need to implement a process of garbage retrieval for a bit and then a second step for complete removal.

Proposed direction: two-phase API

Replace the single-call collect() with a quarantine → grace window → purge state machine. The single-call form can stay as a deprecated convenience, but production usage should be the two-phase form.

Phase 1: quarantine()

Two options for state persistence — both worth a design pass:

  • Storage prefix (_trash/). Move objects into a _trash/ prefix under the same store. Atomic per object. Restore is a move-back. Storage layout encodes the quarantine state directly; no separate metadata table to keep in sync.
  • Database table (_gc_quarantine). Record candidates in a project-level (cross-schema) table: (path, store, quarantined_at, source_schema, source_table, source_attribute). Storage layout unchanged; restore is a leave-in-place + delete the row.

Phase 2: purge()

Phase 0 / convenience: restore(path_or_filter)

  • Explicit unquarantine for operator use: pulled-too-soon recovery, debugging, manual reclaim.

Open design questions

  • Cross-schema scope. Quarantine state spans every schema sharing the store. The _gc_quarantine table needs a project-level home (not per-schema), or each scan must look up quarantine state across schemas. The _trash/ prefix variant sidesteps this naturally.
  • Concurrent-insert handling. What happens if a row is inserted referencing a quarantined path during the grace window? Phase 2's re-check covers it, but should we also block the insert at write time? Probably no — the re-check is cheaper than coordination — but worth deciding explicitly.
  • Recovery from interrupted runs. State machine must be resumable: a quarantine() killed mid-run should leave the system in a defined state, and the next call should pick up where the previous one stopped.
  • Storage backend uniformity. The _trash/ prefix needs atomic move semantics on every supported backend (local, S3, UC Volumes). Most fsspec backends provide this; should be verified per backend.
  • CLI ergonomics. dj.gc.quarantine(...) / dj.gc.purge(...) / dj.gc.restore(...) / dj.gc.format_quarantine_stats(...). Same *schemas + store_name shape as today.
  • Backwards compatibility. Keep collect(dry_run=False) as a single-call shorthand that does quarantine + purge in sequence (with grace_seconds=0), but emit a DeprecationWarning recommending the two-phase form for production. Default dry_run=True already protects against accidental runs.
  • Operational visibility. Quarantine listing / stats should be queryable: how much is currently quarantined, oldest item, stores touched, expected purge time.

Industry references

The pattern is well-established for non-transactional GC across an external store + a transactional DB:

  • Cassandra tombstones with gc_grace_seconds (default 10 days)
  • Databricks VACUUM with retention period (default 7 days)
  • S3 lifecycle soft-delete + permanent-delete
  • POSIX deferred unlink when a file has open handles

In each case the grace window absorbs in-flight transactions that the GC can't see at scan time.

Scope

This is a design + implementation request. The right next step is a written spec covering:

  1. State persistence choice (_trash/ prefix vs. _gc_quarantine table).
  2. Public API surface (quarantine / purge / restore / config keys).
  3. Concurrency model (re-check before purge, behavior on interrupted runs).
  4. Migration / compatibility (what collect() does going forward).
  5. Test plan including concurrent-insert race coverage.

Happy to draft that spec; flagging here so it doesn't get lost.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementIndicates new improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions