fix(#1442): scan_*_references reads raw JSON metadata instead of decoded codec output#1444
Open
kushalbakshi wants to merge 1 commit intodatajoint:masterfrom
Open
Conversation
…d of decoded codec output
`gc.scan_hash_references` and `gc.scan_schema_references` were calling
`table.to_arrays(attr_name)`, which routes through `Expression.to_dicts`
(`expression.py:899`) and runs `decode_attribute` (`codecs.py:518`) for
each value. For the storage codecs (`<blob@>`, `<hash@>`, `<attach@>`,
`<npy@>`, `<object@>`) this means downloading the bytes from external
storage and deserializing them into the codec's runtime type — a
`numpy.ndarray`, an `NpyRef`, an `ObjectRef`, raw bytes, or a local
file-path string — none of which satisfy `_extract_*_refs`'s
`isinstance(value, dict) and "path" in value` check.
Both helpers therefore silently returned empty reference sets. Every
populated schema reported `hash_referenced: 0` and
`schema_paths_referenced: 0`, so every external file looked orphaned to
`scan()` and a subsequent `collect()` would have deleted live data. The
broad `try/except Exception` around the loop never fired because no
exception was raised — `_extract_*_refs` returns `[]` for unrecognized
shapes by design.
The intended design (per `reference/specs/type-system.md`) is for GC to
operate on the raw stored JSON metadata, not the decoded payload.
Replace `table.to_arrays(attr_name)` with
`table.proj(attr_name).cursor(as_dict=True)` in both helpers. The cursor
yields the JSON column value directly: a Python dict on PostgreSQL
(JSONB auto-decoded) or a JSON string on MySQL. `_extract_*_refs`
already handles both shapes (`gc.py:138` string branch, `gc.py:145` dict
branch), so this is backend-agnostic with no adapter dispatch.
Side effect — `scan` is now a metadata-only operation. Previously it
downloaded every external blob just to deserialize and discard the
result via the silent type mismatch; on a 1 TB schema that meant 1 TB
of egress to produce `referenced: 0`. After this change, scan touches
only the JSON column on the database.
Custom-codec authors are unaffected: reference discovery operates on
the raw stored metadata regardless of what the codec's `decode()`
returns, so third-party codecs following the documented
`encode`/`decode` contract get correct GC for free.
Tests
-----
The existing `tests/integration/test_gc.py` mocks `scan_hash_references`,
`scan_schema_references`, and `list_*_paths` directly, so the production
code path through `to_arrays` → `decode_attribute` was never exercised
end-to-end. The mocked tests stay (they cover orchestration: composition
with `list_*_paths`, dry-run vs real, stat-key shape, format strings).
Add a `TestScanWithLiveData` class with three non-mocked end-to-end
tests, one per structurally distinct decoded-value type:
- `test_scan_finds_active_blob_reference` — `<blob@>` (decode → ndarray)
- `test_scan_finds_active_npy_reference` — `<npy@>` (decode → NpyRef)
- `test_scan_finds_active_object_reference` — `<object@>` (decode → ObjectRef)
Each declares a small manual table, inserts one row, and asserts
`scan(schema, store_name='local')` reports the expected `*_referenced`
count > 0. Verified to fail on the pre-fix code:
`{'hash_referenced': 0, 'hash_stored': 1, 'hash_orphaned': 1, ...}`.
Adjacent
--------
Register `gc` in `_lazy_modules` (`src/datajoint/__init__.py`). The
`gc.py` module docstring and the user docs at
`how-to/garbage-collection.md` both invoke GC as `dj.gc.scan(...)`,
which previously raised `AttributeError` because `gc` wasn't lazily
exposed at the package level. Pattern matches the existing
`"diagram": (".diagram", None)` entry.
Out of scope
------------
GC remains non-transaction-safe even after this fix — there is a TOCTOU
window between scan and delete during which a concurrent transaction
could insert a row referencing what looks like an orphan. A two-phase
retrieval/removal API (quarantine → grace window → purge) is the right
remedy and will be tracked as a separate enhancement issue.
Fixes datajoint#1442
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1442.
gc.scan_hash_referencesandgc.scan_schema_referenceswere callingtable.to_arrays(attr_name), which routes throughExpression.to_dicts→decode_attribute→ the codec'sdecode(). For the storage codecs (<blob@>,<hash@>,<attach@>,<npy@>,<object@>) that returns the deserialized payload —numpy.ndarray,NpyRef,ObjectRef,bytes, or local file-pathstr. None of those satisfy_extract_*_refs'sisinstance(value, dict) and "path" in valuecheck, so both helpers silently returned empty reference sets andcollect()would have classified live data as orphaned.This PR replaces the call with
table.proj(attr_name).cursor(as_dict=True)in both helpers. The cursor yields the raw JSON column value: a Python dict on PostgreSQL (JSONB auto-decoded) or a JSON string on MySQL — both already handled by_extract_*_refs(gc.py:138string branch,gc.py:145dict branch). Backend-agnostic, custom-codec-safe, and turns scan into a metadata-only operation (no more downloading every external blob just to discard the deserialized result).Also registers
gcin_lazy_modules(src/datajoint/__init__.py) sodj.gc.scan(...)works as both the module docstring and the user docs athow-to/garbage-collection.mdinvoke it. Pattern matches the existing"diagram": (".diagram", None)entry.Test plan
pytest tests/integration/test_gc.py -k "not TestScanWithLiveData"— 26 existing mocked tests pass unchanged (orchestration coverage stays intact).pytest tests/integration/test_gc.py::TestScanWithLiveData -v— 3 new non-mocked e2e tests pass with the fix:test_scan_finds_active_blob_reference—<blob@>(decode →numpy.ndarray)test_scan_finds_active_npy_reference—<npy@>(decode →NpyRef)test_scan_finds_active_object_reference—<object@>(decode →ObjectRef)gc.pyis reverted to the buggy version. The failure mode reproduces the issue's exact symptom —hash_referenced: 0,schema_paths_referenced: 0, every file classified as orphaned.pre-commit run(ruff, ruff-format, codespell) clean.Test gap closed
tests/integration/test_gc.py:163-166patchesscan_hash_references,scan_schema_references, andlist_*_pathsdirectly, so the production path throughto_arrays→decode_attributewas never exercised end-to-end. That's why this slipped through. The mocked tests stay (orchestration: composition withlist_*_paths, dry-run vs real, stat-key shape, format strings). The newTestScanWithLiveDataclass fills the gap with one e2e test per structurally distinct decoded-value type.Performance side-effect (free byproduct)
Before: scan downloaded every external blob just to deserialize and discard it via the silent type mismatch — a 1 TB schema produced 1 TB of egress to report
referenced: 0. After: scan touches only the JSON column on the database during reference enumeration.Custom-codec authors
Reference discovery now operates on raw stored metadata regardless of what a codec's
decode()returns. Third-party codecs following the documentedencode/decodecontract get correct GC for free — no contract change required.Out of scope
GC remains non-transaction-safe even after this fix — there is a TOCTOU window between scan and delete during which a concurrent transaction could insert a row referencing what looks like an orphan. A two-phase retrieval/removal API (quarantine → grace window → purge) is the right remedy and will be filed as a separate enhancement issue right after this PR opens.
🤖 Generated with Claude Code