Skip to content

fix(#1442): scan_*_references reads raw JSON metadata instead of decoded codec output#1444

Open
kushalbakshi wants to merge 1 commit intodatajoint:masterfrom
kushalbakshi:fix/1442-gc-scan-uses-decoded-values
Open

fix(#1442): scan_*_references reads raw JSON metadata instead of decoded codec output#1444
kushalbakshi wants to merge 1 commit intodatajoint:masterfrom
kushalbakshi:fix/1442-gc-scan-uses-decoded-values

Conversation

@kushalbakshi
Copy link
Copy Markdown
Contributor

Summary

Fixes #1442. gc.scan_hash_references and gc.scan_schema_references were calling table.to_arrays(attr_name), which routes through Expression.to_dictsdecode_attribute → the codec's decode(). For the storage codecs (<blob@>, <hash@>, <attach@>, <npy@>, <object@>) that returns the deserialized payload — numpy.ndarray, NpyRef, ObjectRef, bytes, or local file-path str. None of those satisfy _extract_*_refs's isinstance(value, dict) and "path" in value check, so both helpers silently returned empty reference sets and collect() would have classified live data as orphaned.

This PR replaces the call with table.proj(attr_name).cursor(as_dict=True) in both helpers. The cursor yields the raw JSON column value: a Python dict on PostgreSQL (JSONB auto-decoded) or a JSON string on MySQL — both already handled by _extract_*_refs (gc.py:138 string branch, gc.py:145 dict branch). Backend-agnostic, custom-codec-safe, and turns scan into a metadata-only operation (no more downloading every external blob just to discard the deserialized result).

Also registers gc in _lazy_modules (src/datajoint/__init__.py) so dj.gc.scan(...) works as both the module docstring and the user docs at how-to/garbage-collection.md invoke it. Pattern matches the existing "diagram": (".diagram", None) entry.

Test plan

  • pytest tests/integration/test_gc.py -k "not TestScanWithLiveData" — 26 existing mocked tests pass unchanged (orchestration coverage stays intact).
  • pytest tests/integration/test_gc.py::TestScanWithLiveData -v — 3 new non-mocked e2e tests pass with the fix:
    • test_scan_finds_active_blob_reference<blob@> (decode → numpy.ndarray)
    • test_scan_finds_active_npy_reference<npy@> (decode → NpyRef)
    • test_scan_finds_active_object_reference<object@> (decode → ObjectRef)
  • Regression-proven: the same 3 e2e tests fail when gc.py is reverted to the buggy version. The failure mode reproduces the issue's exact symptom — hash_referenced: 0, schema_paths_referenced: 0, every file classified as orphaned.
  • pre-commit run (ruff, ruff-format, codespell) clean.

Test gap closed

tests/integration/test_gc.py:163-166 patches scan_hash_references, scan_schema_references, and list_*_paths directly, so the production path through to_arraysdecode_attribute was never exercised end-to-end. That's why this slipped through. The mocked tests stay (orchestration: composition with list_*_paths, dry-run vs real, stat-key shape, format strings). The new TestScanWithLiveData class fills the gap with one e2e test per structurally distinct decoded-value type.

Performance side-effect (free byproduct)

Before: scan downloaded every external blob just to deserialize and discard it via the silent type mismatch — a 1 TB schema produced 1 TB of egress to report referenced: 0. After: scan touches only the JSON column on the database during reference enumeration.

Custom-codec authors

Reference discovery now operates on raw stored metadata regardless of what a codec's decode() returns. Third-party codecs following the documented encode/decode contract get correct GC for free — no contract change required.

Out of scope

GC remains non-transaction-safe even after this fix — there is a TOCTOU window between scan and delete during which a concurrent transaction could insert a row referencing what looks like an orphan. A two-phase retrieval/removal API (quarantine → grace window → purge) is the right remedy and will be filed as a separate enhancement issue right after this PR opens.

🤖 Generated with Claude Code

…d of decoded codec output

`gc.scan_hash_references` and `gc.scan_schema_references` were calling
`table.to_arrays(attr_name)`, which routes through `Expression.to_dicts`
(`expression.py:899`) and runs `decode_attribute` (`codecs.py:518`) for
each value. For the storage codecs (`<blob@>`, `<hash@>`, `<attach@>`,
`<npy@>`, `<object@>`) this means downloading the bytes from external
storage and deserializing them into the codec's runtime type — a
`numpy.ndarray`, an `NpyRef`, an `ObjectRef`, raw bytes, or a local
file-path string — none of which satisfy `_extract_*_refs`'s
`isinstance(value, dict) and "path" in value` check.

Both helpers therefore silently returned empty reference sets. Every
populated schema reported `hash_referenced: 0` and
`schema_paths_referenced: 0`, so every external file looked orphaned to
`scan()` and a subsequent `collect()` would have deleted live data. The
broad `try/except Exception` around the loop never fired because no
exception was raised — `_extract_*_refs` returns `[]` for unrecognized
shapes by design.

The intended design (per `reference/specs/type-system.md`) is for GC to
operate on the raw stored JSON metadata, not the decoded payload.
Replace `table.to_arrays(attr_name)` with
`table.proj(attr_name).cursor(as_dict=True)` in both helpers. The cursor
yields the JSON column value directly: a Python dict on PostgreSQL
(JSONB auto-decoded) or a JSON string on MySQL. `_extract_*_refs`
already handles both shapes (`gc.py:138` string branch, `gc.py:145` dict
branch), so this is backend-agnostic with no adapter dispatch.

Side effect — `scan` is now a metadata-only operation. Previously it
downloaded every external blob just to deserialize and discard the
result via the silent type mismatch; on a 1 TB schema that meant 1 TB
of egress to produce `referenced: 0`. After this change, scan touches
only the JSON column on the database.

Custom-codec authors are unaffected: reference discovery operates on
the raw stored metadata regardless of what the codec's `decode()`
returns, so third-party codecs following the documented
`encode`/`decode` contract get correct GC for free.

Tests
-----

The existing `tests/integration/test_gc.py` mocks `scan_hash_references`,
`scan_schema_references`, and `list_*_paths` directly, so the production
code path through `to_arrays` → `decode_attribute` was never exercised
end-to-end. The mocked tests stay (they cover orchestration: composition
with `list_*_paths`, dry-run vs real, stat-key shape, format strings).

Add a `TestScanWithLiveData` class with three non-mocked end-to-end
tests, one per structurally distinct decoded-value type:

- `test_scan_finds_active_blob_reference` — `<blob@>` (decode → ndarray)
- `test_scan_finds_active_npy_reference`  — `<npy@>`  (decode → NpyRef)
- `test_scan_finds_active_object_reference` — `<object@>` (decode → ObjectRef)

Each declares a small manual table, inserts one row, and asserts
`scan(schema, store_name='local')` reports the expected `*_referenced`
count > 0. Verified to fail on the pre-fix code:
`{'hash_referenced': 0, 'hash_stored': 1, 'hash_orphaned': 1, ...}`.

Adjacent
--------

Register `gc` in `_lazy_modules` (`src/datajoint/__init__.py`). The
`gc.py` module docstring and the user docs at
`how-to/garbage-collection.md` both invoke GC as `dj.gc.scan(...)`,
which previously raised `AttributeError` because `gc` wasn't lazily
exposed at the package level. Pattern matches the existing
`"diagram": (".diagram", None)` entry.

Out of scope
------------

GC remains non-transaction-safe even after this fix — there is a TOCTOU
window between scan and delete during which a concurrent transaction
could insert a row referencing what looks like an orphan. A two-phase
retrieval/removal API (quarantine → grace window → purge) is the right
remedy and will be tracked as a separate enhancement issue.

Fixes datajoint#1442
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Critical: dj_gc.scan reports referenced: 0 for every storage column; collect would delete live data

1 participant