feat(metadata): sidecar CSVs for nested data + Firestore-only metadata transaction#152
Open
htsukamoto5 wants to merge 10 commits into
Open
feat(metadata): sidecar CSVs for nested data + Firestore-only metadata transaction#152htsukamoto5 wants to merge 10 commits into
htsukamoto5 wants to merge 10 commits into
Conversation
fix: improve error logging and handling for OSF uploads
fix: fall back to valid PAT when OAuth token refresh fails
Upload queue retry with automatic recovery and dashboard UI
Merge test into main
docs: add FAQ entry about the 32 MB request size limit
…tooling
DataPipe's vendored @jspsych/metadata was a June-2024 fork frozen at v0.0.1
that silently discarded nested object/array trial data. The fix lives on the
upstream main branch but is NOT in the published npm 0.0.3 (which has the same
data-loss bug and would silently drop all data given DataPipe's pre-parsed
input). Rather than depend on the stale npm release or live-track a moving
branch, vendor a PINNED upstream commit and rebuild from it, with tooling to
make future re-syncs a one-command, reviewable step.
- functions/scripts/sync-metadata.mjs (+ npm run sync:metadata): clone upstream
at a ref, build packages/metadata, copy the built dist + sanitized
package.json + LICENSE into functions/metadata/, and record provenance in
VENDORED_FROM.json. Strips the package's scripts (upstream's
prepare:"npm run build" would break `npm install` of the file: dep, since we
ship dist-only) while keeping the csv-parse runtime dep.
- .github/workflows/metadata-drift-check.yml: weekly non-blocking job that
opens/updates a tracking issue when upstream main moves past the pinned commit.
- functions/metadata/: now dist-only, pinned to upstream main 224d336. dist is
committed (deploys need no metadata build); .gitignore updated to un-ignore it.
- functions/package.json: dep stays file:metadata; add explicit typescript
devDep (the build had relied on it transitively via the removed fork).
- functions/src/metadata-production.ts: generate()'s 3rd arg is now a string
ext ('json'|'csv'), not the old boolean csv flag.
- metadata-production.test.js: fixture updated to real output (type -> @type,
numeric -> number); data-derived levels/min-max double as a silent-drop guard;
fixed a pre-existing aliasing bug in the options test.
Verified: metadata-production, metadata-update, metadata-process suites pass;
functions build (tsc) and npm install are clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
functions/metadata is now a pre-built vendored dist with no lockfile or build script, so the "npm ci && npm run build" step in that directory fails with EUSAGE. The dist is committed and installed as a file: dependency by the existing functions npm ci step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Both suites used logs/testlog and each deletes it at test start; jest runs them in parallel workers, so the base64 suite's delete could wipe the data suite's saveData counter between write and read (doc exists via the base64 increment, saveData undefined). Rename the base64 suite's doc to base64-testlog, matching its other doc IDs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The entire metadata block (OSF list/download/upload calls included) ran inside db.runTransaction. Firestore retries the transaction callback on contention, which would re-run those OSF network calls — a latent duplicate-write risk. The transaction now only reads, merges, and writes the Firestore metadata doc; OSF I/O happens before (existence check) and after (mirror upload) the transaction. When the merge needs the OSF copy as its base (Firestore empty, OSF populated), the transaction aborts via a sentinel, the copy is downloaded outside it, and the transaction re-runs. Also fixes two pre-existing bugs in the process: - The Firestore-only branch checked putFileOSF's result with `errorCode !== 210`, which threw even on success (success returns errorCode null); now checks `!response.success`. - The OSF-only branch uploaded the unmerged incoming metadata to OSF while Firestore got the merged version; both now get the merged one. - The create branch's putFileOSF result was silently ignored; it is now checked like the other branches. blockMetadata now receives the OSF token api-data already resolved via resolveToken, instead of re-deriving it with duplicated logic that could trigger a second token refresh per submission. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The re-vendored @jspsych/metadata expands nested object/array trial fields (survey responses, mouse-tracking samples, ...) into dotted sub-variables and exposes the per-row data behind them. Until now DataPipe only used the variable descriptions, so the described data was not actually retrievable in tabular form. This mirrors what the standalone metadata CLI writes per data file: one sidecar CSV per extracted array column (rows keyed by the join keys + element_index) and per object column (one row per trial), named with the library's own Psych-DS helpers (deriveFallbackBase/deriveArrayFilename), placed in the same subfolder as the data file. Because the CLI's sidecars are per source file, per-submission sidecars are the exact incremental equivalent — no cross-session accumulation or dedup state is needed. Flow: produceMetadata() now returns the extraction results alongside the metadata; blockMetadata() builds the sidecar payloads and returns them; api-data uploads them only after the participant's data file itself lands in OSF (no orphan sidecars on 409), best-effort — a sidecar failure is queued in the existing uploadQueue (the retry worker already handles arbitrary filenames and treats 409 as done) and logged, never failing the submission. When the main file is queued because OSF is down, the sidecars are queued alongside it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch includes the three commits of #151 (
fix/metadata-version-drift) — review only the last two commits here (refactor(metadata)+feat(metadata)), or view the diff against the PR A branch. Base istestso CI runs; merge #151 first, then this.Part 1: Firestore-only metadata transaction (refactor)
blockMetadata's entire body — including every OSF network call — ran insidedb.runTransaction. Firestore retries the transaction callback on contention (e.g. two participants submitting concurrently), which would re-run OSF uploads: a latent duplicate-write risk. The transaction is now strictly read → merge → write on the Firestore metadata doc; OSF I/O happens before it (existence check) and after it (mirror upload). When the merge needs the OSF copy as its base (Firestore empty, OSF populated — first submission after a wipe), the transaction aborts via a sentinel, the copy is downloaded outside it, and the transaction re-runs.Fixes three pre-existing bugs found in the process:
putFileOSF's result witherrorCode !== 210, which throws even on success (success returnserrorCode: null), so that branch could never complete.putFileOSF's result entirely (silent metadata loss); it is now checked like the other branches.blockMetadataalso now receives the tokenapi-dataalready resolved viaresolveToken, instead of re-deriving it with ~25 lines of duplicated logic that could trigger a second token refresh per submission.Part 2: sidecar CSVs for nested array/object columns
The re-vendored library (from #151) expands nested trial fields — survey
responseobjects,mouse_tracking_dataarrays — into dotted sub-variables (response.Q0,mouse_tracking_data.x) and exposes the per-row data behind them. Previously the metadata described these variables while the data itself remained locked inside JSON blobs.DataPipe now writes what the standalone metadata CLI writes per data file: one sidecar CSV per extracted array column (rows keyed by join keys +
element_index) and per plain-object column (one row per trial). Since the CLI's sidecars are per source file, per-submission sidecars are the exact incremental equivalent — no cross-session accumulation or dedup state is needed.deriveFallbackBase+deriveArrayFilename+disambiguateArrayFilename), so DataPipe's sidecar names match the CLI's for the same data (e.g.abc123.jsonwith mouse tracking →subject-abc123_measure-mouseTrackingData_data.csv), placed in the same one-level subfolder as the data file. Zero custom naming code to maintain across re-vendors.uploadQueue(whose retry worker already handles arbitrary filenames and treats 409 as done) and logged — it never fails the submission. If the main file gets queued because OSF is down, the sidecars are queued alongside it.Testing
metadata-sidecars.test.jsunit suite (7 tests: naming convention, subfolder placement, CSV shape for arrays/objects, disambiguation, no-extension filenames, empty case).metadata-production.test.jsextended for the new return shape + extraction contents (dotted per-row data verified against the real library).tscbuild clean. Emulator suites (which pin each metadata-state branch's response message — preserved by the refactor) run in this PR's CI.🤖 Generated with Claude Code