feat(eval): add plan-mode and HITL contract evals (D2 follow-up) by zjshen14 · Pull Request #236 · zjshen14/opencli

zjshen14 · 2026-06-07T15:13:47Z

Summary

Extends RunTapeOptions in the replay harness with confirmFn? and forcesConfirmation? so contract evals can drive HITL decisions without an interactive terminal.
Adds a synthesised JSONL tape (plan-mode-write-block) whose single /plan-prefixed turn contains a write call, proving the executor's readOnly guard fires before the call reaches the registry.
Adds src/eval/replay/contract.test.ts with 9 tests covering two D2 contract surfaces: plan-mode write-blocking and the three HITL paths (user_denied, non_interactive, allow).

Related issue

Closes #234

Test plan

npm run typecheck && npm run lint && npm run format:check && npm test — all pass (713 tests, 9 new)
Plan-mode: tool_denied(plan_mode) fires, tool_exec_start count is 0, tape exhausted cleanly
HITL deny: tool_denied(user_denied) fires when confirmFn returns "deny"
HITL non-interactive: tool_denied(non_interactive) fires when no confirmFn is provided
HITL allow: no denial, tool_exec_start fires once when confirmFn returns "allow"

https://claude.ai/code/session_01HRHNetsbTUajSxK33fby1u

Generated by Claude Code

Extends the replay harness with two surfaces from the D2 contract-eval backlog (#234): **Plan-mode contract eval** — a synthesised JSONL tape whose single turn is prefixed with `/plan`, causing agent.run() to run with readOnly=true. The tape's write tool call is blocked by the executor's readOnly guard before reaching TapeRegistry, emitting tool_denied(plan_mode). Asserts: - tool_denied fires exactly once with reason="plan_mode" - tool_exec_start / tool_exec_end counts are both 0 - tape exhausted cleanly, 0 unconsumed results **HITL contract eval** — three programmatic tapes exercising the three confirmation paths: - confirmFn present + returns "deny" → tool_denied(user_denied) - confirmFn absent (non-interactive context) → tool_denied(non_interactive) - confirmFn present + returns "allow" → tool executes normally, no denial RunTapeOptions gains confirmFn? and forcesConfirmation? to wire these paths; runner.ts calls agent.setConfirmFn() and agent.setForcesConfirmationFn() after construction. Closes #234 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

zjshen14 · 2026-06-08T16:15:16Z

Review: feat(eval): add plan-mode and HITL contract evals (D2 follow-up)

What's good: Solid approach — the synthesized JSONL tape for plan-mode, the dedicated contract.test.ts, and the three HITL paths (deny / non-interactive / allow) all read cleanly. The forcesConfirmation escape hatch for TapeRegistry tools is a pragmatic design. CI is green.

Overlap with #237

This PR and #237 were opened one day apart and both address the same #234 sub-items. Critically, both PRs make identical additions to runner.ts: adding confirmFn and forcesConfirmation to RunTapeOptions and wiring them into runTape. The difference is where the test code lands:

PR	Test location	Approach
#236 (this PR)	New `src/eval/replay/contract.test.ts` + JSONL tape + expected JSON	Dedicated contract eval module with file-based tapes
#237 (newer)	Extends `src/eval/replay/synthesized.test.ts`	Inline programmatic tapes in existing file

Merging both would produce a runner.ts conflict and duplicate coverage. Please close one before merging the other.

Design trade-offs:

feat(eval): add plan-mode and HITL contract evals (D2 follow-up) #236: Better separation of concerns — contract evals are clearly distinct from the general synthesized-replay tests; JSONL tapes are inspectable, durable artifacts.
feat(eval): add plan-mode and HITL contract evals to replay harness #237: Less file proliferation; no new JSONL/JSON artifacts needed; synthesized tests stay co-located.

Recommendation: Pick one and close the other — no code-quality blocker in either.

Generated by Claude Code

zjshen14 · 2026-06-26T16:13:55Z

Review: feat(eval): add plan-mode and HITL contract evals (D2 follow-up)

What's good: Well-structured PR. Using a dedicated contract.test.ts rather than extending synthesized.test.ts keeps the intent clear — contract surface documentation is a distinct concern from synthesized behaviour tests. The JSONL tape + .expected.json fixture pattern is more durable than programmatic tape construction. The runner.ts extension is surgical: two optional fields and 3 lines of wiring, correctly importing ConfirmFn from core/executor.ts which respects the layer boundary. README update rounds it out.

Concern: Direct conflict with #237

Both this PR and #237 make identical changes to src/eval/replay/runner.ts — adding the same confirmFn? and forcesConfirmation? fields to RunTapeOptions, the same import type { ConfirmFn }, and the same 3-line wiring block in runTape(). Both originate from the same base commit (9c1073b5). Merging either will conflict on runner.ts for the other.

The test files differ:

This PR (feat(eval): add plan-mode and HITL contract evals (D2 follow-up) #236): creates src/eval/replay/contract.test.ts (9 tests) + JSONL tape + expected JSON fixture + README update
feat(eval): add plan-mode and HITL contract evals to replay harness #237: adds 6 tests inline to src/eval/replay/synthesized.test.ts

This PR is the more complete implementation — dedicated file, fixture-driven, documented. #237's 6 inline tests cover overlapping paths (plan-mode block, HITL deny/non-interactive/allow) without the tape fixture.

Recommendation: Merge this PR. If the inline tests from #237 are also wanted in synthesized.test.ts, they can be added as a follow-up once this lands. Close or rebase #237 to eliminate the runner.ts conflict.

Generated by Claude Code

zjshen14 mentioned this pull request Jun 7, 2026

D2 follow-up: sandbox / plan-mode / HITL contract evals #234

Closed

4 tasks

This was referenced Jun 8, 2026

feat(eval): add plan-mode and HITL contract evals to replay harness #237

Open

fix: activate_skill leaves orphaned functionCall with no functionResponse — breaks Gemini/Anthropic on next turn #241

Closed

This was referenced Jun 16, 2026

bug: config.temperature is loaded but never applied to the LLM client #251

Open

Design: ReAct loop quality improvements + dual-mode execution (ReAct vs Planner) #19

Open

feat: Phase A6 — inline diffs + compact tool rows in renderer #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add plan-mode and HITL contract evals (D2 follow-up)#236

feat(eval): add plan-mode and HITL contract evals (D2 follow-up)#236
zjshen14 wants to merge 1 commit into
mainfrom
feat/issue-234-plan-mode-hitl-contract-evals

zjshen14 commented Jun 7, 2026

Uh oh!

zjshen14 commented Jun 8, 2026

Uh oh!

zjshen14 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zjshen14 commented Jun 7, 2026

Summary

Related issue

Test plan

Uh oh!

zjshen14 commented Jun 8, 2026

Review: feat(eval): add plan-mode and HITL contract evals (D2 follow-up)

Overlap with #237

Uh oh!

zjshen14 commented Jun 26, 2026

Review: feat(eval): add plan-mode and HITL contract evals (D2 follow-up)

Concern: Direct conflict with #237

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants