Skip to content

feat(metadata): sidecar CSVs for nested data + Firestore-only metadata transaction#152

Open
htsukamoto5 wants to merge 10 commits into
testfrom
feat/metadata-sidecars
Open

feat(metadata): sidecar CSVs for nested data + Firestore-only metadata transaction#152
htsukamoto5 wants to merge 10 commits into
testfrom
feat/metadata-sidecars

Conversation

@htsukamoto5

Copy link
Copy Markdown
Member

⚠️ Stacked on #151

This branch includes the three commits of #151 (fix/metadata-version-drift) — review only the last two commits here (refactor(metadata) + feat(metadata)), or view the diff against the PR A branch. Base is test so CI runs; merge #151 first, then this.

Part 1: Firestore-only metadata transaction (refactor)

blockMetadata's entire body — including every OSF network call — ran inside db.runTransaction. Firestore retries the transaction callback on contention (e.g. two participants submitting concurrently), which would re-run OSF uploads: a latent duplicate-write risk. The transaction is now strictly read → merge → write on the Firestore metadata doc; OSF I/O happens before it (existence check) and after it (mirror upload). When the merge needs the OSF copy as its base (Firestore empty, OSF populated — first submission after a wipe), the transaction aborts via a sentinel, the copy is downloaded outside it, and the transaction re-runs.

Fixes three pre-existing bugs found in the process:

  • The Firestore-only branch checked putFileOSF's result with errorCode !== 210, which throws even on success (success returns errorCode: null), so that branch could never complete.
  • The OSF-only branch uploaded the unmerged incoming metadata to OSF while Firestore got the merged version; both now receive the merged one.
  • The create branch ignored putFileOSF's result entirely (silent metadata loss); it is now checked like the other branches.

blockMetadata also now receives the token api-data already resolved via resolveToken, instead of re-deriving it with ~25 lines of duplicated logic that could trigger a second token refresh per submission.

Part 2: sidecar CSVs for nested array/object columns

The re-vendored library (from #151) expands nested trial fields — survey response objects, mouse_tracking_data arrays — into dotted sub-variables (response.Q0, mouse_tracking_data.x) and exposes the per-row data behind them. Previously the metadata described these variables while the data itself remained locked inside JSON blobs.

DataPipe now writes what the standalone metadata CLI writes per data file: one sidecar CSV per extracted array column (rows keyed by join keys + element_index) and per plain-object column (one row per trial). Since the CLI's sidecars are per source file, per-submission sidecars are the exact incremental equivalent — no cross-session accumulation or dedup state is needed.

  • Naming reuses the library's own Psych-DS helpers (deriveFallbackBase + deriveArrayFilename + disambiguateArrayFilename), so DataPipe's sidecar names match the CLI's for the same data (e.g. abc123.json with mouse tracking → subject-abc123_measure-mouseTrackingData_data.csv), placed in the same one-level subfolder as the data file. Zero custom naming code to maintain across re-vendors.
  • Ordering: sidecars upload only after the participant's data file lands in OSF — a 409 on the main file can't leave orphan sidecars.
  • Best-effort: the data file is already safe and sidecars are derivable from it, so a sidecar failure is queued in the existing uploadQueue (whose retry worker already handles arbitrary filenames and treats 409 as done) and logged — it never fails the submission. If the main file gets queued because OSF is down, the sidecars are queued alongside it.
  • Flat data (or CSV submissions) produce no sidecars; nothing changes for those experiments.

Testing

  • New metadata-sidecars.test.js unit suite (7 tests: naming convention, subfolder placement, CSV shape for arrays/objects, disambiguation, no-extension filenames, empty case).
  • metadata-production.test.js extended for the new return shape + extraction contents (dotted per-row data verified against the real library).
  • All 4 metadata unit suites pass locally (23 tests); tsc build clean. Emulator suites (which pin each metadata-state branch's response message — preserved by the refactor) run in this PR's CI.

🤖 Generated with Claude Code

jodeleeuw and others added 10 commits March 28, 2026 08:36
fix: improve error logging and handling for OSF uploads
fix: fall back to valid PAT when OAuth token refresh fails
Upload queue retry with automatic recovery and dashboard UI
docs: add FAQ entry about the 32 MB request size limit
…tooling

DataPipe's vendored @jspsych/metadata was a June-2024 fork frozen at v0.0.1
that silently discarded nested object/array trial data. The fix lives on the
upstream main branch but is NOT in the published npm 0.0.3 (which has the same
data-loss bug and would silently drop all data given DataPipe's pre-parsed
input). Rather than depend on the stale npm release or live-track a moving
branch, vendor a PINNED upstream commit and rebuild from it, with tooling to
make future re-syncs a one-command, reviewable step.

- functions/scripts/sync-metadata.mjs (+ npm run sync:metadata): clone upstream
  at a ref, build packages/metadata, copy the built dist + sanitized
  package.json + LICENSE into functions/metadata/, and record provenance in
  VENDORED_FROM.json. Strips the package's scripts (upstream's
  prepare:"npm run build" would break `npm install` of the file: dep, since we
  ship dist-only) while keeping the csv-parse runtime dep.
- .github/workflows/metadata-drift-check.yml: weekly non-blocking job that
  opens/updates a tracking issue when upstream main moves past the pinned commit.
- functions/metadata/: now dist-only, pinned to upstream main 224d336. dist is
  committed (deploys need no metadata build); .gitignore updated to un-ignore it.
- functions/package.json: dep stays file:metadata; add explicit typescript
  devDep (the build had relied on it transitively via the removed fork).
- functions/src/metadata-production.ts: generate()'s 3rd arg is now a string
  ext ('json'|'csv'), not the old boolean csv flag.
- metadata-production.test.js: fixture updated to real output (type -> @type,
  numeric -> number); data-derived levels/min-max double as a silent-drop guard;
  fixed a pre-existing aliasing bug in the options test.

Verified: metadata-production, metadata-update, metadata-process suites pass;
functions build (tsc) and npm install are clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
functions/metadata is now a pre-built vendored dist with no lockfile or
build script, so the "npm ci && npm run build" step in that directory
fails with EUSAGE. The dist is committed and installed as a file:
dependency by the existing functions npm ci step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Both suites used logs/testlog and each deletes it at test start; jest
runs them in parallel workers, so the base64 suite's delete could wipe
the data suite's saveData counter between write and read (doc exists
via the base64 increment, saveData undefined). Rename the base64
suite's doc to base64-testlog, matching its other doc IDs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The entire metadata block (OSF list/download/upload calls included) ran
inside db.runTransaction. Firestore retries the transaction callback on
contention, which would re-run those OSF network calls — a latent
duplicate-write risk. The transaction now only reads, merges, and writes
the Firestore metadata doc; OSF I/O happens before (existence check) and
after (mirror upload) the transaction. When the merge needs the OSF copy
as its base (Firestore empty, OSF populated), the transaction aborts via
a sentinel, the copy is downloaded outside it, and the transaction
re-runs.

Also fixes two pre-existing bugs in the process:
- The Firestore-only branch checked putFileOSF's result with
  `errorCode !== 210`, which threw even on success (success returns
  errorCode null); now checks `!response.success`.
- The OSF-only branch uploaded the unmerged incoming metadata to OSF
  while Firestore got the merged version; both now get the merged one.
- The create branch's putFileOSF result was silently ignored; it is now
  checked like the other branches.

blockMetadata now receives the OSF token api-data already resolved via
resolveToken, instead of re-deriving it with duplicated logic that could
trigger a second token refresh per submission.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The re-vendored @jspsych/metadata expands nested object/array trial
fields (survey responses, mouse-tracking samples, ...) into dotted
sub-variables and exposes the per-row data behind them. Until now
DataPipe only used the variable descriptions, so the described data was
not actually retrievable in tabular form. This mirrors what the
standalone metadata CLI writes per data file: one sidecar CSV per
extracted array column (rows keyed by the join keys + element_index)
and per object column (one row per trial), named with the library's own
Psych-DS helpers (deriveFallbackBase/deriveArrayFilename), placed in
the same subfolder as the data file.

Because the CLI's sidecars are per source file, per-submission sidecars
are the exact incremental equivalent — no cross-session accumulation or
dedup state is needed.

Flow: produceMetadata() now returns the extraction results alongside
the metadata; blockMetadata() builds the sidecar payloads and returns
them; api-data uploads them only after the participant's data file
itself lands in OSF (no orphan sidecars on 409), best-effort — a
sidecar failure is queued in the existing uploadQueue (the retry worker
already handles arbitrary filenames and treats 409 as done) and logged,
never failing the submission. When the main file is queued because OSF
is down, the sidecars are queued alongside it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants