feat(eval): add plan-mode and HITL contract evals to replay harness#237
feat(eval): add plan-mode and HITL contract evals to replay harness#237zjshen14 wants to merge 1 commit into
Conversation
The replay runner lacked a way to wire confirmFn / forcesConfirmation for HITL contract evals, and no synthesized tapes exercised the plan-mode write-block or the HITL deny / non-interactive / allow paths. - Extends RunTapeOptions with confirmFn and forcesConfirmation; wires them to Agent.setConfirmFn / Agent.setForcesConfirmationFn after construction so evals can drive the HITL gate offline. - Adds plan-mode contract suite: a synthesized tape with a write call in plan mode asserts tool_denied(plan_mode) fires, execute() is never called, and the tape exhausts with zero unconsumed results. - Adds HITL contract suite (deny / non-interactive / allow): uses forcesConfirmation to trigger the gate without requiresConfirmation on TapeRegistry tools; asserts the correct tool_denied reason and execution log state for each confirmFn outcome. Part of #234 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review: feat(eval): add plan-mode and HITL contract evals to replay harnessWhat's good: Clean extension of Overlap with #236This PR and #236 cover the same
Merging both would conflict on Design trade-offs:
Recommendation: Pick one of #236 or #237 and close the other. No correctness issues in either. Generated by Claude Code |
Review: feat(eval): add plan-mode and HITL contract evals to replay harnessWhat's good: The Concern: Direct conflict with #236Both this PR and #236 make identical changes to The test files differ:
Recommendation: Close this PR in favour of #236, which is the more complete implementation. If the 6 inline tests in Generated by Claude Code |
Summary
RunTapeOptionsinsrc/eval/replay/runner.tswithconfirmFnandforcesConfirmationfields, wired toAgent.setConfirmFn/Agent.setForcesConfirmationFnafter construction — enabling HITL contract evals to drive the confirmation gate fully offline.synthesized.test.ts: a synthesized tape with awritecall in plan mode assertstool_denied(plan_mode)fires,execute()is never reached, and the tape exhausts with zero unconsumed results.forcesConfirmationto trigger the gate without modifying TapeRegistry tools; asserts the correcttool_deniedreason and execution-log state for eachconfirmFnoutcome.Closes part of #234 (plan-mode and HITL sub-items).
Test plan
npm run typecheck && npm run lint && npm run format:check && npm test— all pass (710 tests, 0 failures)src/eval/replay/synthesized.test.tsgrows from 8 to 14 tests — 6 new tests covering plan-mode block and HITL deny / non-interactive / allow pathshttps://claude.ai/code/session_01NnteuN9Vv47jNyFvsa1mBp
Generated by Claude Code