fix(results): handle replay row schema aliases#1400
Conversation
Review FindingsP1 - Adjacent trace tests fail after the invalid-score error changed
Verification
Note: |
Deploying agentv with
|
| Latest commit: |
239ab03
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://1c0dafa8.agentv.pages.dev |
| Branch Preview URL: | https://av-results-row-schema-cleanu.agentv.pages.dev |
|
Addressed the blocking trace-test comment in Decision: preserved the trace-specific invalid-score contract for standalone trace/result JSONL loaded through Validation:
|
|
CI Test job follow-up after c234fd0 / failed job https://github.com/EntityProcess/agentv/actions/runs/27683502887/job/81876531953 Root cause from the log:
Fix in
Validation:
|
Re-review resultNo blockers remain for the prior results-schema feedback. Verified the requested points:
Local verification:
GitHub checks on the latest commit are green: Build, Test, Typecheck, Lint, Validate Evals, Validate Marketplace, Check Links, and Cloudflare Pages all pass. Residual risk: this was a targeted re-review of the prior blockers rather than a full functional UAT pass. Given the local targeted coverage plus green CI, I do not see remaining blocking risk for PR #1400. |
Summary
Historical replay-shaped result rows now hydrate consistently across
resultsandinspect: failed rows keep theirtestId, timing/token fields are preserved, andinspect filter --has-toolseestrace.toolCallsaliases.The accepted results command input remains a run workspace or
index.jsonlmanifest. Canonical rows are still snake_case; this PR normalizes only the known historical camelCase aliases at the file boundary, then rejects eval-case-only JSONL with migration guidance instead of treating it as anunknownfailed result.Validation
bun test apps/cli/test/commands/resultsbun test apps/cli/test/commands/eval/artifact-writer.test.tsbun test packages/core/test/evaluation/trace-summary.test.tsbun run typecheckbun run lintbun run typecheckandbun run lintsuccessfully.Red/Green UAT
Red, before the fix on the same replay-shaped row:
results summaryreturnedfailed_test_ids: ["unknown"],total_duration_ms: 0, andtotal_tokens: 0.inspect filter --has-tool rgreturned[].Green, after the fix:
bun apps/cli/src/cli.ts results summary apps/cli/test/fixtures/results/camel-replayreturnsfailed_test_ids: ["wtg-replay-fail"],total_duration_ms: 1234, andtotal_tokens: 15.bun apps/cli/src/cli.ts inspect filter apps/cli/test/fixtures/results/camel-replay/index.jsonl --has-tool rg --format jsonreturns thewtg-replay-failrow withtool_names: ["rg"].Unsupported result row ... Eval-case JSONL is input data, not a results artifact. Run agentv eval <eval-file> --output <run-dir>....Post-Deploy Monitoring & Validation
No additional production monitoring required; this changes local CLI parsing of result artifacts only. After release, validate by running
agentv results summaryandagentv inspect filter --has-toolagainst a historical replay manifest if one is available. Watch CI and user reports forUnsupported result row,failed_test_idscontainingunknown, or tool filters failing to match known trace summaries.