Skip to content

feat(cli,frontend): preserve original under data/raw/ whenever output is transformed#133

Open
Mandyx22 wants to merge 1 commit into
fix/csv-relax-quotesfrom
feat/preserve-raw-on-transform
Open

feat(cli,frontend): preserve original under data/raw/ whenever output is transformed#133
Mandyx22 wants to merge 1 commit into
fix/csv-relax-quotesfrom
feat/preserve-raw-on-transform

Conversation

@Mandyx22

Copy link
Copy Markdown
Contributor

Summary

Broadens raw-original preservation beyond JSON. Previously both the CLI (processFile) and the web wizard (runGenerate) kept the untouched input under data/raw/ only for JSON/JSONL (CSV was assumed to have no separate raw form). This changes the trigger to "the input was transformed" — i.e. preserve the original whenever the Psych-DS output is not a verbatim, same-named copy of the input:

transformed = isJsonDataExt(ext) || !csvVerbatimEligible || mainName !== inputName

Newly covered for CSV inputs:

  • Renamed — a clean CSV whose content is unchanged but whose filename was changed to a Psych-DS-compliant name (e.g. mydata.csvsubject-x_data.csv). The original is kept under its old name. (Near-universal for real OSF data — the high-impact case.)
  • Re-serialised — a CSV that only parsed thanks to quote relaxation (#132) and is therefore rewritten, so the bytes change. The malformed original is preserved.

A clean CSV written byte-for-byte under its own already-compliant name is the only no-raw case, so nothing is duplicated. The existing flat-dir disambiguation and root .psychds-ignore (so the validator skips data/raw/) apply to these preserved CSVs too.

Note

Stacked on #132 (fix/csv-relax-quotes) — base this PR against that branch. This intentionally changes the output layout of currently-working datasets (renamed CSVs now grow a data/raw/ dir + .psychds-ignore), which is why it's a separate PR from the tight #132 bug fix.

Implementation

  • CLI (packages/cli/src/data.ts): capture mainOutputName from whichever write path ran (planned → planned.mainName; non-planned → the built main file), move the raw-write block to after the output is written, and replace the isJsonDataExt gate with the transformed check.
  • Frontend (packages/frontend/src/pages/DataUpload.tsx): hoist csvVerbatimEligible out of the CSV branch and compare the built main filename against file.name.

BOM-only changes are deliberately not treated as transformed — the preserved content is already BOM-stripped, so preserving it would be pointless.

Testing

  • 3 CLI tests (data.test.ts): renamed CSV preserved; already-compliant verbatim CSV gets no data/raw/; re-serialised CSV preserved even with an unchanged name.
  • 3 frontend component tests (DataUpload.test.tsx): drive the real component and inspect convertedStore.paths().
  • Full suite: 664 passing.
  • Manually verified end-to-end on the real OSF phxq4 dataset (the unquoted-stimulus-HTML files that previously failed to parse): all files read, re-serialised output in data/, malformed originals preserved in data/raw/, .psychds-ignore written. Also confirmed clean DataPipe CSVs: renamed → preserved; already-compliant → not preserved.

🤖 Generated with Claude Code

… is transformed

Broaden raw-original preservation beyond JSON. Previously both the CLI
(processFile) and the web wizard (runGenerate) kept the untouched input
under data/raw/ only for JSON/JSONL. Now they preserve it whenever the
Psych-DS output is not a verbatim, same-named copy of the input:

  transformed = isJsonDataExt(ext) || !csvVerbatimEligible || mainName !== inputName

This newly covers CSV inputs that were renamed to a compliant name
(e.g. mydata.csv -> subject-x_data.csv) and CSV inputs that had to be
re-serialised (malformed quotes repaired in #132). A clean CSV written
byte-for-byte under its own compliant name is the only no-raw case, so
nothing is duplicated. The existing flat-dir disambiguation and root
.psychds-ignore (so the validator skips data/raw/) apply to these too.

Tested end-to-end on the real OSF phxq4 dataset (unquoted stimulus HTML
with literal quotes): files read, re-serialised output in data/, malformed
originals preserved in data/raw/. Adds 3 CLI + 3 frontend tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 74cf5a0

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@jspsych/metadata-cli Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant