Skip to content

feat(trace): settle canonical trace projection#1401

Merged
christso merged 4 commits into
mainfrom
av-trace-canonical-projection
Jun 17, 2026
Merged

feat(trace): settle canonical trace projection#1401
christso merged 4 commits into
mainfrom
av-trace-canonical-projection

Conversation

@christso

@christso christso commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

AgentV execution trace sidecars now publish under the canonical agentv.execution_trace.v1 schema with artifact_id and per-test outputs/execution-trace.json artifacts. The direct replay trace source also uses execution-trace language (execution_traces in target config and replay_execution_trace in provider raw metadata), while established result index.jsonl rows remain unchanged.

Derived trace consumers are documented and tested as projections over the canonical artifact: Provider Message[], outputs/transcript.jsonl, TraceSummary, normalized/compact tool trajectory views, OTLP JSON export bodies, and replay provider responses. Per-test transcript JSONL now uses agentv.transcript.message events stored on the execution trace root span, so compatibility rows can include user/system input turns without changing replay's assistant-output projection.

Design Notes

  • Chose agentv.execution_trace.v1 instead of agentv.trace.v1 because agentv.trace.v1 is already used for the normalized trajectory read model.
  • Removed public trace_envelope naming from new wire/config surfaces rather than adding aliases; this trace surface is still early and direct trace compatibility was not required.
  • Kept result JSONL compatibility stable: generated run rows still point at artifact_dir and do not add execution_trace_path or trace_envelope_path.
  • traceEnvelopeToMessages() remains assistant/output-only for replay provider responses. traceEnvelopeToTranscriptMessages() is the transcript-specific projection for outputs/transcript.jsonl.
  • Per-test transcript_path is emitted only when the execution-trace transcript projection actually writes a file.
  • Stabilized the existing repo-manager idle-timeout test after it repeatedly failed under full-suite load while passing in isolation.

Validation

  • bun test packages/core/test/evaluation/trace-envelope.test.ts — 9 pass
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 44 pass
  • bun test packages/core/test/evaluation/replay-fixtures.test.ts — 9 pass
  • bun test packages/core/test/evaluation/trace-summary.test.ts — 15 pass
  • bun test packages/core/test/evaluation/trace-trajectory.test.ts — 9 pass
  • bun test apps/cli/test/commands/trace/trace.test.ts — 49 pass
  • bun run lint — pass
  • bun run typecheck — pass
  • bun run build — pass, with existing Dashboard bundle-size warning
  • Earlier full-suite validation on this branch: bun run test — pass: core 1894, eval 70, phoenix adapter 22, CLI 584, dashboard 89

Red/Green UAT

Pre-change name/shape captured from the existing main-era trace fixture used the old schema name:

{
  "schema_version": "agentv.trace_envelope.v1",
  "span_ops": ["invoke_agent", "chat"],
  "root_span": "8fb9fe8cfb55b1a0"
}

Current branch replay UAT:

bun apps/cli/src/cli.ts eval examples/showcase/trace-evaluation/evals/coding-agent-replay.eval.yaml --target replay_coding_agent --output /tmp/agentv-pr1401-review

Result: PASS (2/2 scored >= 80%, mean: 100%). Generated per-test sidecars are named outputs/execution-trace.json and validate with schema_version: agentv.execution_trace.v1, artifact_id: execution-trace-..., and artifact keys execution_trace_path, answer_path, response_path, transcript_path. The run index.jsonl has neither execution_trace_path nor trace_envelope_path, preserving established result row compatibility.

Per-test outputs/transcript.jsonl now contains replay transcript rows with user and assistant roles. UAT role check:

{
  "inspect-and-fix-config": ["user", "assistant", "assistant"],
  "recover-from-tool-error": ["user", "assistant", "assistant"]
}

Each generated execution trace sidecar contains three agentv.transcript.message events and artifacts.transcript_path: "outputs/transcript.jsonl", proving those transcript rows are regenerated from the canonical execution trace artifact rather than from result.trace as a second source of truth.

Post-Deploy Monitoring & Validation

  • Watch CI for trace/replay/artifact-writer failures and schema validation failures.
  • Search logs and issues for agentv.trace_envelope.v1, trace_envelopes, trace-envelope.json, Invalid execution trace replay record, and Replay provider requires exactly one replay source.
  • Healthy signal: new eval artifact workspaces contain per-test outputs/execution-trace.json sidecars while existing result commands continue reading index.jsonl rows.
  • Failure signal: replay targets reject newly documented execution_traces, downstream tooling expects trace_envelope keys, or artifact writers stop producing transcript/answer sidecars.
  • Mitigation: revert the trace naming commit before release, or add an explicit migration/alias only if a real direct-trace consumer is identified.
  • Validation window/owner: next CI run and first internal trace/replay dogfood run after merge; owner is the AgentV trace/export track.

Compound Engineering
GPT-5

@christso

Copy link
Copy Markdown
Collaborator Author

Code review result: changes requested, but GitHub would not allow this account to submit a formal request-changes review on its own PR, so I am posting the review as a PR comment.

Findings:

  1. [P1] Span projections can reverse valid OTLP timestamp order
    packages/core/src/evaluation/trace-envelope.ts:1225 and packages/core/src/evaluation/trace-envelope.ts:1257 sort startTimeUnixNano with localeCompare(), while packages/core/src/evaluation/trace-envelope.ts:1354 builds the normalized TraceArtifact in raw array order. OTLP timestamps are decimal strings, not zero-padded strings, so valid values like 900000000 and 1000000000 sort as 1000000000 before 900000000. I confirmed this with a minimal execution trace: traceEnvelopeToMessages() returned late,early, and traceEnvelopeToToolTrajectoryView() returned Late,Early. That can corrupt Provider Message[], replay candidate selection, transcript projections built from messages, and exact/in-order tool trajectory grading for imported/replayed traces. Please use a shared numeric/BigInt span-time comparator for every ordered projection.

  2. [P2] outputs/transcript.jsonl is still authored from result.trace, not projected from the execution trace artifact
    apps/cli/src/commands/eval/artifact-writer.ts:1259 writes transcript.jsonl from the result-local trace, then apps/cli/src/commands/eval/artifact-writer.ts:1262 independently builds the execution-trace sidecar from the same result. The second writer path repeats at apps/cli/src/commands/eval/artifact-writer.ts:1339 and apps/cli/src/commands/eval/artifact-writer.ts:1342. This leaves two sibling authored trace views even though the PR/bead contract says transcript JSONL is a projection over agentv.execution_trace.v1; the current tests only assert that both artifacts exist and have compatible-looking fields, not that transcript rows are derived from the canonical artifact. Either generate transcript rows from the execution trace projection, or explicitly document this as a compatibility artifact with a focused drift test.

  3. [P2] The OTLP JSON projection emits status values existing AgentV OTLP readers do not understand
    packages/core/src/evaluation/trace-envelope.ts:1441 copies envelope spans into traceEnvelopeToOtlpJson(), and packages/core/src/evaluation/trace-envelope.ts:1451 passes status: span.status through as { code: 'OK' | 'ERROR' | 'UNSET' }. AgentV's existing OTLP JSON importer reads status.code as a numeric OTLP code and counts errors with span.status?.code === 2, so an execution trace projected through this helper would not surface error spans correctly. If this helper is the OTLP export body projection, it should normalize status/kind to the same OTLP shape that OtlpJsonFileExporter and inspect already consume, with a regression test for an ERROR span.

Verification:

  • bun test packages/core/test/evaluation/trace-envelope.test.ts — 7 pass
  • bun test packages/core/test/evaluation/replay-fixtures.test.ts — 9 pass
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 43 pass
  • bun run typecheck — pass
  • Additional setup needed in this worktree: bun install, then bun run build so the CLI test could resolve built @agentv/core; build passed with the existing Dashboard chunk-size warning.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 17, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 4251817
Status: ✅  Deploy successful!
Preview URL: https://5c724e08.agentv.pages.dev
Branch Preview URL: https://av-trace-canonical-projectio.agentv.pages.dev

View logs

@christso

Copy link
Copy Markdown
Collaborator Author

Addressed the blocking review comment in commit 3bc92c69 (fix(trace): align execution trace projections).

Fixes made:

  • Timestamp ordering: added a shared numeric nanosecond comparator and orderedSpans() projection helper. Message[], compact tool trajectory, normalized TraceArtifact, and OTLP JSON now order by numeric start_time_unix_nano, with parent spans kept before child spans on tied starts.
  • outputs/transcript.jsonl: per-test transcript artifacts are now written after outputs/execution-trace.json is built and are generated from traceEnvelopeToMessages(envelope) over the canonical execution trace artifact. The aggregate run-level transcript.jsonl remains the established result JSONL compatibility artifact.
  • OTLP JSON projection: traceEnvelopeToOtlpJson() now emits numeric OTLP span kind and status.code values (UNSET=0, OK=1, ERROR=2) instead of AgentV envelope strings. Added a CLI reader regression that recognizes an ERROR span as trace.error_count = 1.

Canonical schema remains agentv.execution_trace.v1. Rationale: agentv.trace.v1 is already the normalized trajectory/read-model schema, while execution_trace names the canonical execution span artifact without exposing the implementation term trace_envelope.

Files changed:

  • packages/core/src/evaluation/trace-envelope.ts
  • packages/core/test/evaluation/trace-envelope.test.ts
  • apps/cli/src/commands/eval/artifact-writer.ts
  • apps/cli/test/commands/eval/artifact-writer.test.ts
  • apps/cli/test/commands/trace/trace.test.ts

Red evidence from the pre-fix branch tip (origin/av-trace-canonical-projection):

  • traceEnvelopeToMessages() / traceEnvelopeToToolTrajectoryView() used startTimeUnixNano.localeCompare(...).
  • traceEnvelopeToTraceArtifact() iterated for (const span of envelope.trace.spans) in raw array order.
  • traceEnvelopeToOtlpJson() emitted kind: span.kind and status: span.status.
  • per-test outputs/transcript.jsonl was written before the sidecar via traceToTranscriptJsonLines(result.trace, ...).

Green projection sample from this branch:

{
  "schema_version": "agentv.execution_trace.v1",
  "ordered_messages": ["early", "late"],
  "otlp_root_status": { "code": 2, "message": "Provider timed out" },
  "otlp_root_kind": 0,
  "otlp_span_order": [
    "invoke_agent codex",
    "chat codex",
    "execute_tool EarlyTool",
    "chat codex",
    "execute_tool LateTool"
  ]
}

Validation:

  • bun test packages/core/test/evaluation/trace-envelope.test.ts => 8 pass
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts => 43 pass
  • bun test packages/core/test/evaluation/replay-fixtures.test.ts => 9 pass
  • bun test packages/core/test/evaluation/trace-summary.test.ts => 15 pass
  • bun test packages/core/test/evaluation/trace-trajectory.test.ts => 9 pass
  • bun test apps/cli/test/commands/trace/trace.test.ts => 49 pass
  • bun run lint => pass
  • bun run typecheck => pass on serial rerun
  • bun run build => pass (existing dashboard chunk-size warning only)

Blockers: none known.

@christso

Copy link
Copy Markdown
Collaborator Author

Coordinator review blocker before merge:

The branch fixes the earlier timestamp ordering and OTLP status/kind issues, but the transcript projection currently drops user/system rows.

Evidence from local UAT on the branch:

bun apps/cli/src/cli.ts eval examples/showcase/trace-evaluation/evals/coding-agent-replay.eval.yaml --target replay_coding_agent --output /tmp/agentv-pr1401-review

The generated per-test outputs/transcript.jsonl files contain only assistant rows. The unit test in apps/cli/test/commands/eval/artifact-writer.test.ts also changed the old user+assistant expectation to one assistant row. That contradicts the spec language that outputs/transcript.jsonl remains a derived compatibility/read view and the fixture matrix saying no-tool answer should include user + final assistant rows.

Please fix before merge:

  • The canonical agentv.execution_trace.v1 artifact must preserve enough input/user/system transcript information to regenerate existing outputs/transcript.jsonl rows, without relying on result.trace as a second source of truth.
  • Keep provider replay response projection separate if replay should return assistant/provider output only. A dedicated transcript projection helper is fine if traceEnvelopeToMessages() intentionally remains provider-response output.
  • Restore/add tests proving per-test outputs/transcript.jsonl has user + assistant rows for the existing artifact-writer case and is generated from the execution trace artifact.
  • Ensure index/artifact transcript_path is only present when the file is actually written.
  • Update the PR body UAT note if needed; it currently says user/assistant rows remain, but the current branch output does not.

No merge until this is addressed and CI is green again.

@christso

Copy link
Copy Markdown
Collaborator Author

Addressed the coordinator blocker in commit 4251817f (fix(trace): preserve transcript rows in execution trace).

What changed:

  • Added canonical root-span transcript events named agentv.transcript.message; each stores source order plus a snake_case message object for transcript compatibility rows.
  • Added traceEnvelopeToTranscriptMessages() for outputs/transcript.jsonl so transcript rows are projected from agentv.execution_trace.v1 without changing traceEnvelopeToMessages() / replay provider responses, which remain assistant-output-only.
  • Updated per-test artifact writing to use the transcript projection and to emit transcript_path only when the per-test transcript file is actually written.
  • Restored the artifact-writer transcript expectation to user + assistant rows, with expected rows derived from the execution trace artifact.
  • Updated the execution trace spec and PR body UAT note.

Validation:

  • bun test packages/core/test/evaluation/trace-envelope.test.ts => 9 pass
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts => 44 pass
  • bun test packages/core/test/evaluation/replay-fixtures.test.ts => 9 pass
  • bun run lint => pass
  • bun run typecheck => pass
  • UAT: bun apps/cli/src/cli.ts eval examples/showcase/trace-evaluation/evals/coding-agent-replay.eval.yaml --target replay_coding_agent --output /tmp/agentv-pr1401-review => PASS 2/2

UAT transcript evidence:

  • inspect-and-fix-config: ["user", "assistant", "assistant"]
  • recover-from-tool-error: ["user", "assistant", "assistant"]
  • Each per-test outputs/execution-trace.json has schema_version: agentv.execution_trace.v1, 3 agentv.transcript.message events, and artifacts.transcript_path: "outputs/transcript.jsonl".

Blockers: none known.

@christso christso marked this pull request as ready for review June 17, 2026 12:19
@christso christso merged commit 121e22b into main Jun 17, 2026
8 checks passed
@christso christso deleted the av-trace-canonical-projection branch June 17, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant