feat(metadata): sidecar CSVs for nested data + Firestore-only metadata transaction by htsukamoto5 · Pull Request #152 · jspsych/datapipe

htsukamoto5 · 2026-07-02T23:30:58Z

⚠️ Stacked on #151

This branch includes the three commits of #151 (fix/metadata-version-drift) — review only the last two commits here (refactor(metadata) + feat(metadata)), or view the diff against the PR A branch. Base is test so CI runs; merge #151 first, then this.

Part 1: Firestore-only metadata transaction (refactor)

blockMetadata's entire body — including every OSF network call — ran inside db.runTransaction. Firestore retries the transaction callback on contention (e.g. two participants submitting concurrently), which would re-run OSF uploads: a latent duplicate-write risk. The transaction is now strictly read → merge → write on the Firestore metadata doc; OSF I/O happens before it (existence check) and after it (mirror upload). When the merge needs the OSF copy as its base (Firestore empty, OSF populated — first submission after a wipe), the transaction aborts via a sentinel, the copy is downloaded outside it, and the transaction re-runs.

Fixes three pre-existing bugs found in the process:

The Firestore-only branch checked putFileOSF's result with errorCode !== 210, which throws even on success (success returns errorCode: null), so that branch could never complete.
The OSF-only branch uploaded the unmerged incoming metadata to OSF while Firestore got the merged version; both now receive the merged one.
The create branch ignored putFileOSF's result entirely (silent metadata loss); it is now checked like the other branches.

blockMetadata also now receives the token api-data already resolved via resolveToken, instead of re-deriving it with ~25 lines of duplicated logic that could trigger a second token refresh per submission.

Part 2: sidecar CSVs for nested array/object columns

The re-vendored library (from #151) expands nested trial fields — survey response objects, mouse_tracking_data arrays — into dotted sub-variables (response.Q0, mouse_tracking_data.x) and exposes the per-row data behind them. Previously the metadata described these variables while the data itself remained locked inside JSON blobs.

DataPipe now writes what the standalone metadata CLI writes per data file: one sidecar CSV per extracted array column (rows keyed by join keys + element_index) and per plain-object column (one row per trial). Since the CLI's sidecars are per source file, per-submission sidecars are the exact incremental equivalent — no cross-session accumulation or dedup state is needed.

Naming reuses the library's own Psych-DS helpers (deriveFallbackBase + deriveArrayFilename + disambiguateArrayFilename), so DataPipe's sidecar names match the CLI's for the same data (e.g. abc123.json with mouse tracking → subject-abc123_measure-mouseTrackingData_data.csv), placed in the same one-level subfolder as the data file. Zero custom naming code to maintain across re-vendors.
Ordering: sidecars upload only after the participant's data file lands in OSF — a 409 on the main file can't leave orphan sidecars.
Best-effort: the data file is already safe and sidecars are derivable from it, so a sidecar failure is queued in the existing uploadQueue (whose retry worker already handles arbitrary filenames and treats 409 as done) and logged — it never fails the submission. If the main file gets queued because OSF is down, the sidecars are queued alongside it.
Flat data (or CSV submissions) produce no sidecars; nothing changes for those experiments.

Testing

New metadata-sidecars.test.js unit suite (7 tests: naming convention, subfolder placement, CSV shape for arrays/objects, disambiguation, no-extension filenames, empty case).
metadata-production.test.js extended for the new return shape + extraction contents (dotted per-row data verified against the real library).
All 4 metadata unit suites pass locally (23 tests); tsc build clean. Emulator suites (which pin each metadata-state branch's response message — preserved by the refactor) run in this PR's CI.

🤖 Generated with Claude Code

fix: improve error logging and handling for OSF uploads

fix: fall back to valid PAT when OAuth token refresh fails

Upload queue retry with automatic recovery and dashboard UI

Merge test into main

docs: add FAQ entry about the 32 MB request size limit

@type

…tooling DataPipe's vendored @jspsych/metadata was a June-2024 fork frozen at v0.0.1 that silently discarded nested object/array trial data. The fix lives on the upstream main branch but is NOT in the published npm 0.0.3 (which has the same data-loss bug and would silently drop all data given DataPipe's pre-parsed input). Rather than depend on the stale npm release or live-track a moving branch, vendor a PINNED upstream commit and rebuild from it, with tooling to make future re-syncs a one-command, reviewable step. - functions/scripts/sync-metadata.mjs (+ npm run sync:metadata): clone upstream at a ref, build packages/metadata, copy the built dist + sanitized package.json + LICENSE into functions/metadata/, and record provenance in VENDORED_FROM.json. Strips the package's scripts (upstream's prepare:"npm run build" would break `npm install` of the file: dep, since we ship dist-only) while keeping the csv-parse runtime dep. - .github/workflows/metadata-drift-check.yml: weekly non-blocking job that opens/updates a tracking issue when upstream main moves past the pinned commit. - functions/metadata/: now dist-only, pinned to upstream main 224d336. dist is committed (deploys need no metadata build); .gitignore updated to un-ignore it. - functions/package.json: dep stays file:metadata; add explicit typescript devDep (the build had relied on it transitively via the removed fork). - functions/src/metadata-production.ts: generate()'s 3rd arg is now a string ext ('json'|'csv'), not the old boolean csv flag. - metadata-production.test.js: fixture updated to real output (type -> @type, numeric -> number); data-derived levels/min-max double as a silent-drop guard; fixed a pre-existing aliasing bug in the options test. Verified: metadata-production, metadata-update, metadata-process suites pass; functions build (tsc) and npm install are clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

functions/metadata is now a pre-built vendored dist with no lockfile or build script, so the "npm ci && npm run build" step in that directory fails with EUSAGE. The dist is committed and installed as a file: dependency by the existing functions npm ci step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Both suites used logs/testlog and each deletes it at test start; jest runs them in parallel workers, so the base64 suite's delete could wipe the data suite's saveData counter between write and read (doc exists via the base64 increment, saveData undefined). Rename the base64 suite's doc to base64-testlog, matching its other doc IDs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The entire metadata block (OSF list/download/upload calls included) ran inside db.runTransaction. Firestore retries the transaction callback on contention, which would re-run those OSF network calls — a latent duplicate-write risk. The transaction now only reads, merges, and writes the Firestore metadata doc; OSF I/O happens before (existence check) and after (mirror upload) the transaction. When the merge needs the OSF copy as its base (Firestore empty, OSF populated), the transaction aborts via a sentinel, the copy is downloaded outside it, and the transaction re-runs. Also fixes two pre-existing bugs in the process: - The Firestore-only branch checked putFileOSF's result with `errorCode !== 210`, which threw even on success (success returns errorCode null); now checks `!response.success`. - The OSF-only branch uploaded the unmerged incoming metadata to OSF while Firestore got the merged version; both now get the merged one. - The create branch's putFileOSF result was silently ignored; it is now checked like the other branches. blockMetadata now receives the OSF token api-data already resolved via resolveToken, instead of re-deriving it with duplicated logic that could trigger a second token refresh per submission. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The re-vendored @jspsych/metadata expands nested object/array trial fields (survey responses, mouse-tracking samples, ...) into dotted sub-variables and exposes the per-row data behind them. Until now DataPipe only used the variable descriptions, so the described data was not actually retrievable in tabular form. This mirrors what the standalone metadata CLI writes per data file: one sidecar CSV per extracted array column (rows keyed by the join keys + element_index) and per object column (one row per trial), named with the library's own Psych-DS helpers (deriveFallbackBase/deriveArrayFilename), placed in the same subfolder as the data file. Because the CLI's sidecars are per source file, per-submission sidecars are the exact incremental equivalent — no cross-session accumulation or dedup state is needed. Flow: produceMetadata() now returns the extraction results alongside the metadata; blockMetadata() builds the sidecar payloads and returns them; api-data uploads them only after the participant's data file itself lands in OSF (no orphan sidecars on 409), best-effort — a sidecar failure is queued in the existing uploadQueue (the retry worker already handles arbitrary filenames and treats 409 as done) and logged, never failing the submission. When the main file is queued because OSF is down, the sidecars are queued alongside it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jodeleeuw and others added 10 commits March 28, 2026 08:36

Merge pull request #140 from jspsych/test

a9d2bad

fix: improve error logging and handling for OSF uploads

Merge pull request #141 from jspsych/test

9b032f9

fix: fall back to valid PAT when OAuth token refresh fails

Merge pull request #143 from jspsych/test

b17a4df

Upload queue retry with automatic recovery and dashboard UI

Merge pull request #147 from jspsych/test

775b299

Merge test into main

Merge pull request #149 from jspsych/test

ff9bf07

docs: add FAQ entry about the 32 MB request size limit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(metadata): sidecar CSVs for nested data + Firestore-only metadata transaction#152

feat(metadata): sidecar CSVs for nested data + Firestore-only metadata transaction#152
htsukamoto5 wants to merge 10 commits into
testfrom
feat/metadata-sidecars

htsukamoto5 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

htsukamoto5 commented Jul 2, 2026

⚠️ Stacked on #151

Part 1: Firestore-only metadata transaction (refactor)

Part 2: sidecar CSVs for nested array/object columns

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants