Phase 2c: a named high concern routes to review, not insufficient_evidence by pengfei-threemoonslab · Pull Request #222 · ThreeMoonsLab/agents-shipgate

pengfei-threemoonslab · 2026-06-16T20:17:38Z

What

When a scan turns up an active (not baseline-accepted) high/critical review finding, the release decision is now review_required even if low-confidence extraction would otherwise have produced insufficient_evidence.

Precedence is now: blocked → review_required (active high/critical) → insufficient_evidence → review_required (other) → passed.

Why

The 2026-06-01 Stripe pilot (stripe/ai #232) and the W24/W25 mining both showed the dominant real-world failure was a silent slide into insufficient_evidence: the gate actually had something concrete — a full toolkit mounted with no scope bound — but reported only the vaguer "we couldn't see enough to gate."

Phase 2a (#221) named that concern with SHIP-SCOPE-TOOLKIT-UNBOUNDED (high). Phase 2c makes the verdict act on it: a named high concern routes a human to a specific, actionable finding instead of disappearing into the evidence gate.

Safety

Both verdicts are equally non-auto-mergeable (can_merge_without_human=False) — this loses no gate strength, it just makes the more actionable one win.
blocked still outranks everything.
IE still fires when the only signal is thin evidence with no named high concern.
A baseline-accepted (matched) high finding does not elevate — acknowledged debt is not an active concern (covered by a test).
The low-confidence detail is preserved on evidence_coverage.evidence_gaps, so the extraction-coverage signal is not lost. _decision_reason already emits the honest combined wording: "N findings need review and evidence coverage is incomplete."

Scope discipline

Verdict-value change only — no report_schema_version bump (generate_schemas.py --check clean), no new public surface. The extraction_coverage ratio surface and IE-threshold calibration are deliberately left as separate, data-driven follow-ups rather than bundled into the core verdict path.

Tests

test_active_high_review_finding_outranks_insufficient_evidence — high-active elevates.
test_accepted_high_finding_does_not_outrank_insufficient_evidence — accepted high does not.
Existing test_insufficient_evidence_outranks_review_required (medium-active) still asserts IE — the contrast case.
test_scan_of_stripe_head_fixture_* e2e now asserts the pilot's head fixture yields release_decision.decision == "review_required" (ties 2a + 2c together).
Full suite green; zero golden/contract/CLI ripple.

🤖 Generated with Claude Code

…dence An *active* (non-baseline-accepted) high/critical review finding now elevates the release decision to `review_required` even when the low-confidence extraction gate would otherwise have produced `insufficient_evidence`. Why: the 2026-06-01 Stripe pilot and the W24/W25 mining showed the dominant real-world failure was a silent slide into `insufficient_evidence` — the gate HAD something concrete (an unbounded toolkit) but reported only "we couldn't see enough." Phase 2a named that concern (SHIP-SCOPE-TOOLKIT-UNBOUNDED, high); this routes it to a human instead of burying it under the evidence gate. Safety: both verdicts are equally non-auto-mergeable (can_merge_without_human=False), so this loses no gate strength. `blocked` still outranks everything; IE still fires when the only signal is thin evidence with no named high concern; baseline-accepted (matched) high debt does NOT elevate (acknowledged debt, not an active concern). The low-confidence detail is preserved in evidence_coverage.evidence_gaps, and _decision_reason already emits the honest combined wording ("N findings need review and evidence coverage is incomplete"). No schema change (verdict-value change only); generate_schemas --check clean. - release_decision.py: track has_active_high_review; insert the elevation branch above the IE branch. - test_release_decision.py: high-active elevates; medium-active and high-accepted do not (the contrast cases). - test_toolkit_bounds_check.py: the stripe-head e2e now asserts release_decision.decision == "review_required" (ties 2a+2c together). - CHANGELOG: Unreleased note. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ract docs Addresses two review findings on PR #222. [P2] Degraded-evidence cases could route to coding_agent. The new precedence elevates a below-threshold-evidence case to `review_required` when an active high/critical finding exists, but `fix_task` keyed its degraded-evidence authority escalation on `merge_verdict == "insufficient_evidence"`. A mechanically-fixable high finding + 2/2 low-confidence tools therefore landed on `review_required` and routed to coding_agent / safe_to_attempt=True — an auto-fix path on evidence too weak to gate. Fix: extract the IE-threshold predicate into a single public helper `evidence_below_ie_threshold(evidence, tool_count=...)` in release_decision.py (now the one source of truth, used by build_release_decision too), and OR it into fix_task's `authority_escalation`. A degraded-evidence case is now always human-routed regardless of which non-mergeable verdict it carries. The human also gets the concrete "make the surface enumerable" remedies in that case, not only on the bare IE verdict. Regression tests: degraded-evidence + mechanically-fixable high → human (was the bug); high finding + full evidence → still coding_agent (escalation fires on degraded evidence, not severity). [P2] Contract docs described the old precedence. Updated STABILITY.md, docs/agent-contract-current.md, and docs/report-reading-for-agents.md to document the active-high/critical exception, the full precedence order, that the evidence gap is preserved in `evidence_coverage` (read the counts, not the label, to detect degraded evidence), and that verify keeps both cases human-routed. Regenerated llms-full.txt. Also scrubbed an inaccurate "v0.27" marker (code comments + docs): these contract docs use `v0.X` for report_schema_version (currently 0.26), and this change does not bump the schema — replaced with the "Phase 2c" milestone name. Full suite green; ruff + schema-drift clean; no schema bump. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ded evidence Addresses the third review finding on PR #222, the report-only counterpart to the verify fix_task fix. [P2] `agent_summary.first_recommended_action` still recommended auto-apply on degraded evidence. The Phase 2c elevation means a below-IE-threshold scan now surfaces as `review_required` when an active high/critical finding exists. agent_summary's action picker prioritized `auto_appliable > 0` before evidence remediation for ALL review_required cases — on the prior assumption that review_required implied trustworthy-enough evidence (true before Phase 2c). So an active high auto_apply finding + 2/2 low-confidence tools emitted a runnable `apply-patches ... --apply` command, telling a report-only consumer to "fix" a scan the gate already said it can't trust. Fix: thread the total tool count into build_agent_summary (report_builder has `tools`) so it can evaluate the shared `evidence_below_ie_threshold` predicate, and add a review_required + below-threshold branch to the action picker that emits the evidence/human info action before auto-apply — exactly like the insufficient_evidence path. Surgical: only the genuinely below-threshold case flips; sub-threshold review_required (1-3 source warnings, or low-confidence tools below the ratio) still recommends apply-patches with an evidence note. Regression tests: active high auto_apply + 2/2 low tools (tool_count=2) → info/no-command (was the bug); same finding with 2/10 low tools (sub-threshold) → still the apply-patches command (escalation keys on the ratio, not severity). Verified the only two machine-actionable, verdict-conditional apply-patches recommenders are fix_task (fixed last round) and agent_summary (this round); the other apply-patches mentions are static CLI help / rendered instructions. Full suite green; ruff + schema-drift clean; no schema bump; no golden ripple. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pengfei-threemoonslab and others added 3 commits June 16, 2026 13:17

pengfei-threemoonslab merged commit 7bc250f into main Jun 16, 2026
2 checks passed

pengfei-threemoonslab deleted the claude/phase2c-ie-actionable branch June 16, 2026 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2c: a named high concern routes to review, not insufficient_evidence#222

Phase 2c: a named high concern routes to review, not insufficient_evidence#222
pengfei-threemoonslab merged 3 commits into
mainfrom
claude/phase2c-ie-actionable

pengfei-threemoonslab commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pengfei-threemoonslab commented Jun 16, 2026

What

Why

Safety

Scope discipline

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant