Skip to content

feat(eval): add plan-mode and HITL contract evals (D2 follow-up)#236

Open
zjshen14 wants to merge 1 commit into
mainfrom
feat/issue-234-plan-mode-hitl-contract-evals
Open

feat(eval): add plan-mode and HITL contract evals (D2 follow-up)#236
zjshen14 wants to merge 1 commit into
mainfrom
feat/issue-234-plan-mode-hitl-contract-evals

Conversation

@zjshen14

@zjshen14 zjshen14 commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Summary

  • Extends RunTapeOptions in the replay harness with confirmFn? and forcesConfirmation? so contract evals can drive HITL decisions without an interactive terminal.
  • Adds a synthesised JSONL tape (plan-mode-write-block) whose single /plan-prefixed turn contains a write call, proving the executor's readOnly guard fires before the call reaches the registry.
  • Adds src/eval/replay/contract.test.ts with 9 tests covering two D2 contract surfaces: plan-mode write-blocking and the three HITL paths (user_denied, non_interactive, allow).

Related issue

Closes #234

Test plan

  • npm run typecheck && npm run lint && npm run format:check && npm test — all pass (713 tests, 9 new)
  • Plan-mode: tool_denied(plan_mode) fires, tool_exec_start count is 0, tape exhausted cleanly
  • HITL deny: tool_denied(user_denied) fires when confirmFn returns "deny"
  • HITL non-interactive: tool_denied(non_interactive) fires when no confirmFn is provided
  • HITL allow: no denial, tool_exec_start fires once when confirmFn returns "allow"

https://claude.ai/code/session_01HRHNetsbTUajSxK33fby1u


Generated by Claude Code

Extends the replay harness with two surfaces from the D2 contract-eval
backlog (#234):

**Plan-mode contract eval** — a synthesised JSONL tape whose single turn
is prefixed with `/plan`, causing agent.run() to run with readOnly=true.
The tape's write tool call is blocked by the executor's readOnly guard
before reaching TapeRegistry, emitting tool_denied(plan_mode). Asserts:
- tool_denied fires exactly once with reason="plan_mode"
- tool_exec_start / tool_exec_end counts are both 0
- tape exhausted cleanly, 0 unconsumed results

**HITL contract eval** — three programmatic tapes exercising the three
confirmation paths:
- confirmFn present + returns "deny" → tool_denied(user_denied)
- confirmFn absent (non-interactive context) → tool_denied(non_interactive)
- confirmFn present + returns "allow" → tool executes normally, no denial

RunTapeOptions gains confirmFn? and forcesConfirmation? to wire these
paths; runner.ts calls agent.setConfirmFn() and
agent.setForcesConfirmationFn() after construction.

Closes #234
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zjshen14

zjshen14 commented Jun 8, 2026

Copy link
Copy Markdown
Owner Author

Review: feat(eval): add plan-mode and HITL contract evals (D2 follow-up)

What's good: Solid approach — the synthesized JSONL tape for plan-mode, the dedicated contract.test.ts, and the three HITL paths (deny / non-interactive / allow) all read cleanly. The forcesConfirmation escape hatch for TapeRegistry tools is a pragmatic design. CI is green.


Overlap with #237

This PR and #237 were opened one day apart and both address the same #234 sub-items. Critically, both PRs make identical additions to runner.ts: adding confirmFn and forcesConfirmation to RunTapeOptions and wiring them into runTape. The difference is where the test code lands:

PR Test location Approach
#236 (this PR) New src/eval/replay/contract.test.ts + JSONL tape + expected JSON Dedicated contract eval module with file-based tapes
#237 (newer) Extends src/eval/replay/synthesized.test.ts Inline programmatic tapes in existing file

Merging both would produce a runner.ts conflict and duplicate coverage. Please close one before merging the other.

Design trade-offs:

Recommendation: Pick one and close the other — no code-quality blocker in either.


Generated by Claude Code

@zjshen14

Copy link
Copy Markdown
Owner Author

Review: feat(eval): add plan-mode and HITL contract evals (D2 follow-up)

What's good: Well-structured PR. Using a dedicated contract.test.ts rather than extending synthesized.test.ts keeps the intent clear — contract surface documentation is a distinct concern from synthesized behaviour tests. The JSONL tape + .expected.json fixture pattern is more durable than programmatic tape construction. The runner.ts extension is surgical: two optional fields and 3 lines of wiring, correctly importing ConfirmFn from core/executor.ts which respects the layer boundary. README update rounds it out.

Concern: Direct conflict with #237

Both this PR and #237 make identical changes to src/eval/replay/runner.ts — adding the same confirmFn? and forcesConfirmation? fields to RunTapeOptions, the same import type { ConfirmFn }, and the same 3-line wiring block in runTape(). Both originate from the same base commit (9c1073b5). Merging either will conflict on runner.ts for the other.

The test files differ:

This PR is the more complete implementation — dedicated file, fixture-driven, documented. #237's 6 inline tests cover overlapping paths (plan-mode block, HITL deny/non-interactive/allow) without the tape fixture.

Recommendation: Merge this PR. If the inline tests from #237 are also wanted in synthesized.test.ts, they can be added as a follow-up once this lands. Close or rebase #237 to eliminate the runner.ts conflict.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

D2 follow-up: sandbox / plan-mode / HITL contract evals

2 participants