Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# Changelog

## Unreleased

- **A named high concern now routes to review, not `insufficient_evidence`.**
When a scan turns up an *active* (not baseline-accepted) high/critical review
finding, the release decision is now `review_required` even if low-confidence
extraction would otherwise have produced `insufficient_evidence`. Both
verdicts are equally non-auto-mergeable, but `review_required` points the
human at a specific, actionable finding (e.g. the new
`SHIP-SCOPE-TOOLKIT-UNBOUNDED`) instead of the vaguer "we couldn't see
enough." `blocked` still outranks everything; IE still fires when the only
signal is thin extraction. The 2026-06-01 Stripe pilot's silent/IE case now
surfaces as a routed review. `evidence_gaps` are preserved on the report
either way, so the extraction-coverage signal is not lost.

## 0.13.0 - 2026-06-12

- **Accepted-debt exception workflow (baseline schema 0.6).** `baseline save`
Expand Down
36 changes: 29 additions & 7 deletions STABILITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -331,14 +331,36 @@ the scan also has low-confidence tools or source warnings.

When there are no blockers, `insufficient_evidence` means the static inputs are
not strong enough for Shipgate to gate release confidently. It does **not**
prove the agent is unsafe. The decision fires when low-confidence tools are at
least `max(1, ceil(tool_count × 0.5))`, or when source-loader warnings exceed
`3`. One to three source warnings without blockers route to `review_required`
so a human still sees the degraded source coverage.

The intended recovery is to provide clearer local evidence — for example an MCP
prove the agent is unsafe. The evidence is considered below threshold when
low-confidence tools are at least `max(1, ceil(tool_count × 0.5))`, or when
source-loader warnings exceed `3`. One to three source warnings without
blockers route to `review_required` so a human still sees the degraded source
coverage.

**Active high/critical findings take precedence over the IE label.** When the
evidence is below threshold *and* there is an active (non-baseline-accepted)
high- or critical-severity review finding, the decision is `review_required`
rather than `insufficient_evidence`. Both verdicts are
equally non-mergeable (`can_merge_without_human` is false either way), but
`review_required` points the reviewer at a specific, named finding instead of
the vaguer "we couldn't see enough." The evidence gap is **not** lost: the
underlying counts remain in `release_decision.evidence_coverage`
(`low_confidence_tool_count`, `source_warning_count`, `evidence_gaps[]`), so a
consumer must read those fields rather than the verdict label alone to know
whether evidence was degraded. `insufficient_evidence` still fires when the only
signal is weak evidence with no named high/critical concern; `blocked` still
takes precedence over both. The precedence is therefore:
`blocked` → `review_required` (active high/critical) → `insufficient_evidence`
→ `review_required` (other) → `passed`.

The intended recovery for a degraded-evidence case — whichever of the two
verdicts it lands on — is to provide clearer local evidence — for example an MCP
export, OpenAPI spec, explicit local tool inventory, broader OpenAI Agents SDK
source path, or validation trace — and rerun the scan.
source path, or validation trace — and rerun the scan. When the decision is
`review_required` because of an active high/critical finding, also resolve that
finding. `agents-shipgate verify` keeps both cases human-routed
(`fix_task.actor = "human"`): a degraded-evidence case never opens an automated
coding-agent fix path, regardless of which verdict it carries.

### Check IDs

Expand Down
2 changes: 1 addition & 1 deletion docs/agent-contract-current.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ hashes. It is not a second gate; the release gate remains

In `agents-shipgate-reports/report.json`:

- `release_decision.decision` — `"blocked"` / `"review_required"` / `"insufficient_evidence"` / `"passed"`. Baseline-aware. **This is the gating signal.** Blockers take precedence. If there are no blockers, `insufficient_evidence` (added v0.14) fires when evidence coverage is degraded past threshold: low-confidence tools are at least `max(1, ceil(tool_count × 0.5))`, or source-loader warnings exceed `3`. One to three source warnings without blockers route to `review_required`. `insufficient_evidence` means the scan cannot confidently gate release from the available static evidence; it does not prove the agent is unsafe. Switch on the enum with a `review_required` fallback for unknown future values.
- `release_decision.decision` — `"blocked"` / `"review_required"` / `"insufficient_evidence"` / `"passed"`. Baseline-aware. **This is the gating signal.** Precedence is `blocked` → `review_required` (active high/critical) → `insufficient_evidence` → `review_required` (other) → `passed`. Blockers take precedence over everything. The evidence is degraded past threshold when low-confidence tools are at least `max(1, ceil(tool_count × 0.5))`, or source-loader warnings exceed `3`. With no blockers, degraded evidence is `insufficient_evidence` (added v0.14) **unless** there is an active (non-baseline-accepted) high/critical review finding, in which case it is `review_required` so the reviewer is pointed at a named concern rather than the vaguer label. The evidence gap is preserved either way in `release_decision.evidence_coverage` — read those counts, not the verdict label, to detect degraded evidence. One to three source warnings without blockers also route to `review_required`. `insufficient_evidence` means the scan cannot confidently gate release from the available static evidence; it does not prove the agent is unsafe. Switch on the enum with a `review_required` fallback for unknown future values.
- `release_decision.blockers[]` — items that block release on this run.
- `release_decision.review_items[]` — items the human reviewer should look at; includes baseline-matched accepted debt.
- `release_decision.{blockers,review_items}[].capability_refs` (v0.24+) — stable capability IDs copied from the originating finding when a policy or policy-pack rule matched a `CapabilityFactV1`. Empty for findings that are not capability-policy matches. This is audit metadata only; `release_decision.decision` remains the gate.
Expand Down
8 changes: 6 additions & 2 deletions docs/report-reading-for-agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,19 @@ The CLI's stable contract names this signal explicitly: run `agents-shipgate con

Branch on the four values (treat unknown future values as `review_required` per the [STABILITY.md additivity contract](../STABILITY.md#what-may-change-additively-in-any-minor-release)):

Precedence (highest first): `blocked` → `review_required` (active high/critical) → `insufficient_evidence` → `review_required` (other) → `passed`.

| `decision` | Meaning | Agent behavior |
|---|---|---|
| `"blocked"` | Active, unaccepted blockers exist. CI will fail in strict mode. | Surface blockers; do not auto-merge; do not assert evidence categories — see [`agent-autofix-boundary.md`](agent-autofix-boundary.md). |
| `"insufficient_evidence"` (v0.14+) | With no active blockers, evidence coverage is degraded past threshold: low-confidence tools are at least `max(1, ceil(tool_count × 0.5))`, or source-loader warnings exceed `3`. One to three source warnings route to `review_required`. The scan can't gate release reliably, but this does not prove the agent is unsafe. | Surface the `release_decision.reason` verbatim; recommend gathering deeper sources (MCP exports, OpenAPI specs, explicit inventories, broader SDK source paths, eval traces) and re-running. Do not auto-merge. |
| `"review_required"` | Review items exist (often baseline-matched accepted debt, capability/intent misalignments, or sub-threshold evidence gaps). | Surface review items as a human handoff; safe mechanical patches may still apply via `apply-patches --confidence high`. |
| `"insufficient_evidence"` (v0.14+) | With no active blockers **and no active high/critical review finding**, evidence coverage is degraded past threshold: low-confidence tools are at least `max(1, ceil(tool_count × 0.5))`, or source-loader warnings exceed `3`. The scan can't gate release reliably, but this does not prove the agent is unsafe. | Surface the `release_decision.reason` verbatim; recommend gathering deeper sources (MCP exports, OpenAPI specs, explicit inventories, broader SDK source paths, eval traces) and re-running. Do not auto-merge. |
| `"review_required"` | Review items exist (often baseline-matched accepted debt, capability/intent misalignments, or sub-threshold evidence gaps). This **also** covers a degraded-evidence case that carries an active (non-baseline-accepted) high/critical finding — the verdict names the concern instead of the vaguer `insufficient_evidence`, but the evidence gap is still present in `evidence_coverage`. One to three source warnings without blockers also land here. | Surface review items as a human handoff. Safe mechanical patches may still apply via `apply-patches --confidence high` — **unless** evidence is degraded (check `evidence_coverage.low_confidence_tool_count` / `source_warning_count`), in which case treat it like `insufficient_evidence` and gather deeper sources first. `verify`'s `fix_task` already routes degraded-evidence cases to a human. |
| `"passed"` | No active blockers, no review items, evidence coverage clean. | Mechanical patches (if any) may apply; otherwise nothing to do. |

The decision is **baseline-aware**: a baseline-matched critical surfaces in `release_decision.review_items` (accepted debt), not in `release_decision.blockers`. Compare with the legacy `summary.status` field, which is *baseline-blind* — see Anti-patterns below.

> **Don't switch on the verdict label to detect degraded evidence.** A degraded-evidence case can surface as either `insufficient_evidence` or `review_required` (the latter when an active high/critical finding names the concern). Read `release_decision.evidence_coverage.{low_confidence_tool_count, source_warning_count, evidence_gaps[]}` to know whether evidence was thin, regardless of the label.

### Step 2 · `release_decision.{reason, blockers, review_items, fail_policy.would_fail_ci}`

Once you have the decision, read the supporting fields:
Expand Down
2 changes: 1 addition & 1 deletion llms-full.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1009,7 +1009,7 @@ hashes. It is not a second gate; the release gate remains

In `agents-shipgate-reports/report.json`:

- `release_decision.decision` — `"blocked"` / `"review_required"` / `"insufficient_evidence"` / `"passed"`. Baseline-aware. **This is the gating signal.** Blockers take precedence. If there are no blockers, `insufficient_evidence` (added v0.14) fires when evidence coverage is degraded past threshold: low-confidence tools are at least `max(1, ceil(tool_count × 0.5))`, or source-loader warnings exceed `3`. One to three source warnings without blockers route to `review_required`. `insufficient_evidence` means the scan cannot confidently gate release from the available static evidence; it does not prove the agent is unsafe. Switch on the enum with a `review_required` fallback for unknown future values.
- `release_decision.decision` — `"blocked"` / `"review_required"` / `"insufficient_evidence"` / `"passed"`. Baseline-aware. **This is the gating signal.** Precedence is `blocked` → `review_required` (active high/critical) → `insufficient_evidence` → `review_required` (other) → `passed`. Blockers take precedence over everything. The evidence is degraded past threshold when low-confidence tools are at least `max(1, ceil(tool_count × 0.5))`, or source-loader warnings exceed `3`. With no blockers, degraded evidence is `insufficient_evidence` (added v0.14) **unless** there is an active (non-baseline-accepted) high/critical review finding, in which case it is `review_required` so the reviewer is pointed at a named concern rather than the vaguer label. The evidence gap is preserved either way in `release_decision.evidence_coverage` — read those counts, not the verdict label, to detect degraded evidence. One to three source warnings without blockers also route to `review_required`. `insufficient_evidence` means the scan cannot confidently gate release from the available static evidence; it does not prove the agent is unsafe. Switch on the enum with a `review_required` fallback for unknown future values.
- `release_decision.blockers[]` — items that block release on this run.
- `release_decision.review_items[]` — items the human reviewer should look at; includes baseline-matched accepted debt.
- `release_decision.{blockers,review_items}[].capability_refs` (v0.24+) — stable capability IDs copied from the originating finding when a policy or policy-pack rule matched a `CapabilityFactV1`. Empty for findings that are not capability-policy matches. This is audit metadata only; `release_decision.decision` remains the gate.
Expand Down
50 changes: 45 additions & 5 deletions src/agents_shipgate/ci/release_decision.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,26 @@ def _low_confidence_tool_threshold(tool_count: int) -> int:
return max(1, math.ceil(tool_count * _LOW_CONFIDENCE_TOOL_RATIO))


def evidence_below_ie_threshold(
evidence: EvidenceCoverageDecision, *, tool_count: int
) -> bool:
"""True when extraction evidence is too weak to gate release on its own.

This is the exact predicate `build_release_decision` uses to raise the
`insufficient_evidence` verdict, exposed so downstream projections can
reason about it directly. An active high/critical review finding *elevates*
such a case to `review_required` (a more actionable verdict), so the verdict
label alone no longer tells a consumer whether evidence was degraded — the
verify fix_task authority routing needs this predicate to keep
degraded-evidence cases human-routed regardless of which of the two
non-mergeable verdicts they landed on.
"""
return (
evidence.low_confidence_tool_count >= _low_confidence_tool_threshold(tool_count)
or evidence.source_warning_count > _MAX_TOLERATED_SOURCE_WARNINGS
)


def build_release_decision(
*,
report: ReadinessReport,
Expand All @@ -55,6 +75,11 @@ def build_release_decision(
review_items: list[ReleaseDecisionItem] = []
contribution_rules: list[ContributionRule] = []
blocker_severities: set[Severity] = {"critical", *fail_on_resolved}
# Phase 2c: an active (non-accepted) high/critical finding routed to
# review is a *named* concern — the gate has decided "a human must look",
# which is more specific than "insufficient_evidence". Tracked here so the
# decision can prefer review_required over IE when one exists (see below).
has_active_high_review = False

# v0.17: iterate the FULL findings list (not just `active`) so the
# audit row set is exhaustive over report.findings. The branching
Expand Down Expand Up @@ -121,6 +146,11 @@ def build_release_decision(
or finding.requires_human_review is True
):
review_items.append(_to_item(finding))
if (
finding.severity in {"critical", "high"}
and finding.baseline_status != "matched"
):
has_active_high_review = True
contribution_rules.append(
_rule(
finding,
Expand Down Expand Up @@ -178,15 +208,25 @@ def build_release_decision(
exit_code=exit_code,
)

low_confidence_threshold = _low_confidence_tool_threshold(len(tools))
evidence_is_degraded = evidence_below_ie_threshold(evidence, tool_count=len(tools))

decision: ReleaseDecisionStatus
if blockers:
decision = "blocked"
elif (
evidence.low_confidence_tool_count >= low_confidence_threshold
or evidence.source_warning_count > _MAX_TOLERATED_SOURCE_WARNINGS
):
elif has_active_high_review:
# Phase 2c: a named, active high/critical concern (e.g. an unbounded
# toolkit mounted on the agent) is not "insufficient evidence" — the
# gate HAS something concrete for a human to review. Prefer the
# actionable review_required over the vaguer insufficient_evidence.
# Both are equally non-auto-mergeable (can_merge_without_human=False),
# so this loses no safety; the low-confidence detail is still carried
# in evidence_coverage.evidence_gaps. IE stays the verdict when the
# only signal is weak evidence with no named high concern. NOTE: when
# evidence is *also* degraded here, the verify fix_task still routes to
# a human via evidence_below_ie_threshold (the same predicate), so the
# elevation never opens an auto-fix path on weak evidence.
decision = "review_required"
elif evidence_is_degraded:
decision = "insufficient_evidence"
elif (
review_items
Expand Down
28 changes: 26 additions & 2 deletions src/agents_shipgate/cli/verify/fix_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import shlex

from agents_shipgate.ci.release_decision import evidence_below_ie_threshold
from agents_shipgate.core.agent_controls import FORBIDDEN_SHORTCUTS
from agents_shipgate.schemas.report import Finding, ReadinessReport
from agents_shipgate.schemas.verifier import (
Expand Down Expand Up @@ -109,6 +110,19 @@ def build_fix_task(

gating = _gating_findings(report)

# Degraded static evidence (below the IE threshold) is an authority gap
# regardless of which verdict it produced. An active high/critical finding
# elevates a degraded-evidence case from `insufficient_evidence` to
# `review_required` (a more actionable verdict),
# so keying the escalation on `merge_verdict == "insufficient_evidence"`
# alone would let a mechanically-fixable high finding open a coding-agent
# auto-fix path on weak evidence. Compute the threshold directly from the
# same predicate the release decision uses so the two never drift.
evidence_degraded = evidence_below_ie_threshold(
report.release_decision.evidence_coverage,
tool_count=len(report.tool_inventory),
)

# The coding-agent route is the only non-human outcome and it MUST fail
# closed: every gating finding has to be explicitly mechanical
# (``autofix_safe is True`` AND ``requires_human_review is False``). A
Expand All @@ -123,6 +137,7 @@ def build_fix_task(
capability_review.policy_weakened
or capability_review.trust_root_touched
or merge_verdict in {"insufficient_evidence", "unknown"}
or evidence_degraded
)
if mechanical and not authority_escalation:
return VerifierFixTask(
Expand All @@ -143,7 +158,11 @@ def build_fix_task(
actor="human",
safe_to_attempt=False,
instructions=_human_instructions(
report, capability_review, gating, merge_verdict=merge_verdict
report,
capability_review,
gating,
merge_verdict=merge_verdict,
evidence_degraded=evidence_degraded,
),
allowed_repairs=_human_repairs(
report,
Expand Down Expand Up @@ -181,11 +200,16 @@ def _human_instructions(
gating: list[Finding],
*,
merge_verdict: MergeVerdict = "human_review_required",
evidence_degraded: bool = False,
) -> list[str]:
decision = report.release_decision
assert decision is not None
out: list[str] = [decision.reason]
if merge_verdict == "insufficient_evidence":
# Surface the concrete "make the hidden authority enumerable" remedies
# whenever evidence is degraded — not only on the bare IE verdict. A
# high-finding case elevated to review_required carries the same
# evidence gap and the human needs the same remedy.
if merge_verdict == "insufficient_evidence" or evidence_degraded:
out.extend(_insufficient_evidence_remedies(report))
if capability_review.policy_weakened:
out.append(
Expand Down
Loading