Phase 2c: a named high concern routes to review, not insufficient_evidence#222
Merged
Merged
Conversation
…dence
An *active* (non-baseline-accepted) high/critical review finding now elevates
the release decision to `review_required` even when the low-confidence
extraction gate would otherwise have produced `insufficient_evidence`.
Why: the 2026-06-01 Stripe pilot and the W24/W25 mining showed the dominant
real-world failure was a silent slide into `insufficient_evidence` — the gate
HAD something concrete (an unbounded toolkit) but reported only "we couldn't
see enough." Phase 2a named that concern (SHIP-SCOPE-TOOLKIT-UNBOUNDED, high);
this routes it to a human instead of burying it under the evidence gate.
Safety: both verdicts are equally non-auto-mergeable
(can_merge_without_human=False), so this loses no gate strength. `blocked`
still outranks everything; IE still fires when the only signal is thin
evidence with no named high concern; baseline-accepted (matched) high debt
does NOT elevate (acknowledged debt, not an active concern). The
low-confidence detail is preserved in evidence_coverage.evidence_gaps, and
_decision_reason already emits the honest combined wording ("N findings need
review and evidence coverage is incomplete").
No schema change (verdict-value change only); generate_schemas --check clean.
- release_decision.py: track has_active_high_review; insert the elevation
branch above the IE branch.
- test_release_decision.py: high-active elevates; medium-active and
high-accepted do not (the contrast cases).
- test_toolkit_bounds_check.py: the stripe-head e2e now asserts
release_decision.decision == "review_required" (ties 2a+2c together).
- CHANGELOG: Unreleased note.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ract docs Addresses two review findings on PR #222. [P2] Degraded-evidence cases could route to coding_agent. The new precedence elevates a below-threshold-evidence case to `review_required` when an active high/critical finding exists, but `fix_task` keyed its degraded-evidence authority escalation on `merge_verdict == "insufficient_evidence"`. A mechanically-fixable high finding + 2/2 low-confidence tools therefore landed on `review_required` and routed to coding_agent / safe_to_attempt=True — an auto-fix path on evidence too weak to gate. Fix: extract the IE-threshold predicate into a single public helper `evidence_below_ie_threshold(evidence, tool_count=...)` in release_decision.py (now the one source of truth, used by build_release_decision too), and OR it into fix_task's `authority_escalation`. A degraded-evidence case is now always human-routed regardless of which non-mergeable verdict it carries. The human also gets the concrete "make the surface enumerable" remedies in that case, not only on the bare IE verdict. Regression tests: degraded-evidence + mechanically-fixable high → human (was the bug); high finding + full evidence → still coding_agent (escalation fires on degraded evidence, not severity). [P2] Contract docs described the old precedence. Updated STABILITY.md, docs/agent-contract-current.md, and docs/report-reading-for-agents.md to document the active-high/critical exception, the full precedence order, that the evidence gap is preserved in `evidence_coverage` (read the counts, not the label, to detect degraded evidence), and that verify keeps both cases human-routed. Regenerated llms-full.txt. Also scrubbed an inaccurate "v0.27" marker (code comments + docs): these contract docs use `v0.X` for report_schema_version (currently 0.26), and this change does not bump the schema — replaced with the "Phase 2c" milestone name. Full suite green; ruff + schema-drift clean; no schema bump. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ded evidence Addresses the third review finding on PR #222, the report-only counterpart to the verify fix_task fix. [P2] `agent_summary.first_recommended_action` still recommended auto-apply on degraded evidence. The Phase 2c elevation means a below-IE-threshold scan now surfaces as `review_required` when an active high/critical finding exists. agent_summary's action picker prioritized `auto_appliable > 0` before evidence remediation for ALL review_required cases — on the prior assumption that review_required implied trustworthy-enough evidence (true before Phase 2c). So an active high auto_apply finding + 2/2 low-confidence tools emitted a runnable `apply-patches ... --apply` command, telling a report-only consumer to "fix" a scan the gate already said it can't trust. Fix: thread the total tool count into build_agent_summary (report_builder has `tools`) so it can evaluate the shared `evidence_below_ie_threshold` predicate, and add a review_required + below-threshold branch to the action picker that emits the evidence/human info action before auto-apply — exactly like the insufficient_evidence path. Surgical: only the genuinely below-threshold case flips; sub-threshold review_required (1-3 source warnings, or low-confidence tools below the ratio) still recommends apply-patches with an evidence note. Regression tests: active high auto_apply + 2/2 low tools (tool_count=2) → info/no-command (was the bug); same finding with 2/10 low tools (sub-threshold) → still the apply-patches command (escalation keys on the ratio, not severity). Verified the only two machine-actionable, verdict-conditional apply-patches recommenders are fix_task (fixed last round) and agent_summary (this round); the other apply-patches mentions are static CLI help / rendered instructions. Full suite green; ruff + schema-drift clean; no schema bump; no golden ripple. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
When a scan turns up an active (not baseline-accepted) high/critical review finding, the release decision is now
review_requiredeven if low-confidence extraction would otherwise have producedinsufficient_evidence.Precedence is now:
blocked→review_required(active high/critical) →insufficient_evidence→review_required(other) →passed.Why
The 2026-06-01 Stripe pilot (
stripe/ai#232) and the W24/W25 mining both showed the dominant real-world failure was a silent slide intoinsufficient_evidence: the gate actually had something concrete — a full toolkit mounted with no scope bound — but reported only the vaguer "we couldn't see enough to gate."Phase 2a (#221) named that concern with
SHIP-SCOPE-TOOLKIT-UNBOUNDED(high). Phase 2c makes the verdict act on it: a named high concern routes a human to a specific, actionable finding instead of disappearing into the evidence gate.Safety
can_merge_without_human=False) — this loses no gate strength, it just makes the more actionable one win.blockedstill outranks everything.evidence_coverage.evidence_gaps, so the extraction-coverage signal is not lost._decision_reasonalready emits the honest combined wording: "N findings need review and evidence coverage is incomplete."Scope discipline
Verdict-value change only — no
report_schema_versionbump (generate_schemas.py --checkclean), no new public surface. Theextraction_coverageratio surface and IE-threshold calibration are deliberately left as separate, data-driven follow-ups rather than bundled into the core verdict path.Tests
test_active_high_review_finding_outranks_insufficient_evidence— high-active elevates.test_accepted_high_finding_does_not_outrank_insufficient_evidence— accepted high does not.test_insufficient_evidence_outranks_review_required(medium-active) still asserts IE — the contrast case.test_scan_of_stripe_head_fixture_*e2e now asserts the pilot's head fixture yieldsrelease_decision.decision == "review_required"(ties 2a + 2c together).🤖 Generated with Claude Code