Constructed-adversarial accuracy benchmark (blocked-recall = 1.0) by pengfei-threemoonslab · Pull Request #220 · ThreeMoonsLab/agents-shipgate

pengfei-threemoonslab · 2026-06-15T05:11:27Z

Summary

The first published, non-circular verdict-accuracy result — the "verdict provably right" north-star artifact. It's the half real-history mining can't supply: W25 showed real merged PRs almost never contain a must_block capability change (9 decided / 241 PRs), so the benchmark's positives come from the repo's bundled fixtures.

The key property is non-circularity: each fixture was built to be a specific case, so its label is the fixture's documented design intent (external ground truth) — not a post-hoc opinion about what the engine returned. Scoring the engine's verdict against those definitional labels is a legitimate measurement, not grading-its-own-homework.

Result (7-case curated set)

label \ verdict	allow	review	insufficient_evidence	block
`safe_to_merge`	2	0	0	0
`needs_human`	0	1	1	0
`must_block`	0	0	0	3

Metric	Value
`blocked_recall`	1.0 (3/3 known-unsafe → blocked)
`benign_escalation_rate`	0.0 (0/2 known-safe → allow)
`needs_human_caught`	1.0 (2/2 → review / insufficient_evidence, never auto-passed)

A clean confusion matrix: the gate blocks what's known-unsafe, lets through what's known-safe, and routes the ambiguous to a human.

How it fits

This is one of three strata the W25 finding called for. The committed mined runs supply the other two — the negative control (the 226 trigger-skips) and the real-history extraction-coverage (insufficient_evidence) rate. Together: blocked-recall from constructed positives + noise/coverage from mined reality.

What's here

benchmark/miner/constructed.py — runs the curated bundled fixtures via fixture run, records each verdict as a score-able row, reusing the same confusion-matrix code as the mined corpus.
CLI python -m benchmark.miner constructed to (re)generate; committed results/constructed.{jsonl,csv,labels.csv}.
Tests: a fast committed-data pin of the headline, plus a live-engine regression (test_live_engine_still_produces_the_constructed_verdicts) that re-runs the fixtures so a change regressing a blocked verdict fails in CI, not silently in the data file.

Type

Check or risk-model change
Input adapter change
CLI or GitHub Action behavior
Report, schema, or SARIF output
Documentation only (benchmark tooling + tests over bundled fixtures; no src/ change)

Verification

CI is authoritative for python -m ruff check ., python -m compileall -q src tests, and python -m pytest.

Full miner suite (test_miner + test_miner_labels + test_miner_corpus + test_miner_constructed, 34 tests) green; ruff clean; git diff --check clean on the new artifacts.
The live-engine regression re-runs all 7 fixtures and confirms the committed corpus matches.

Release-readiness notes

No user-code import added to default scan paths
No network access added to default scan paths (runs bundled fixtures locally)
New or changed check IDs documented (n/a)
Report/schema changes are additive or documented (n/a — benchmark-local)

…1.0) The first published, non-circular verdict-accuracy result — the "verdict provably right" north-star artifact, and the half real-history mining can't supply (W25: 9 decided / 241 PRs, almost no must_block positives). Each bundled fixture was built to be a specific case, so its label is the fixture's documented design intent (external ground truth), not a post-hoc opinion about the engine's output — scoring the engine's verdict against those definitional labels is non-circular. benchmark/miner/constructed.py runs the curated fixtures via `fixture run` and records each verdict as a score-able row, reusing the same confusion-matrix code as the mined corpus. Result on the 7-case curated set (results/constructed.{jsonl,labels.csv}): - blocked_recall = 1.0 (3/3 must_block → blocked) - benign_escalation_rate = 0.0 (0/2 safe → allow) - needs_human_caught = 1.0 (2/2 → review / insufficient_evidence) CLI: `python -m benchmark.miner constructed` (re)generates it. Tests: a fast committed-data pin of the headline + a live-engine regression that re-runs the fixtures so a future change that breaks a blocked verdict fails in CI, not silently in the data file. README documents the matrix and that the mined runs supply the complementary negative-control + extraction-coverage halves. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… (PR #220 review) Review P2: the miner's child `python -m agents_shipgate` runs only pin AGENTS_SHIPGATE_AGENT_MODE; they relied on PYTHONPATH being set (conftest does this under pytest, nothing does outside it). So the documented `python -m benchmark.miner constructed` (and mine/evaluate) could resolve to an older installed agents-shipgate. Reproduced: with src/ off the path the child hit the editable 0.8.0/0.11.0 and 4 fixtures errored "no report.json". Fix at the shared layer: cli_env() (was evaluate._cli_env) now prepends the checkout's src/ to the child PYTHONPATH, so the child imports THIS source tree regardless of what's installed. constructed.py drops its weaker local copy and uses the shared cli_env, so mine / evaluate / constructed are all hermetic. Tests: a deterministic unit test that cli_env puts <repo>/src first, and a CLI-level integration test that regenerates the constructed corpus with src/ removed from the parent env and still exits 0 with all 7 rows. The repro now succeeds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

#220 review) Review P2: the prior cli_env() fix made child `python -m agents_shipgate` subprocesses import this checkout, but evaluate_pr() imports `agents_shipgate.triggers` in the PARENT process to compute run/skip. Run by hand with src/ off the parent path (an older installed wheel, or an editable .pth for a different checkout), `mine`/`evaluate` decided triggers from a stale catalog before the hermetic subprocesses ran. Fix: _ensure_repo_src_on_path() prepends this checkout's src/ to sys.path immediately before that lazy import (idempotent; no-op under pytest). Verified end-to-end: with PYTHONPATH=<repo-root only>, the CLI parent now imports agents_shipgate 0.13.0 from this worktree's src and its trigger decision matches the in-process result exactly. Tests: a deterministic unit test that the helper front-loads src when absent, and a CLI-level `evaluate` integration test run with src/ excluded from the parent env (exit 0, trigger decision from this checkout). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pengfei-threemoonslab and others added 3 commits June 14, 2026 22:11

pengfei-threemoonslab merged commit 8ab1706 into main Jun 16, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constructed-adversarial accuracy benchmark (blocked-recall = 1.0)#220

Constructed-adversarial accuracy benchmark (blocked-recall = 1.0)#220
pengfei-threemoonslab merged 3 commits into
mainfrom
claude/wsp-accuracy-constructed

pengfei-threemoonslab commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pengfei-threemoonslab commented Jun 15, 2026

Summary

Result (7-case curated set)

How it fits

What's here

Type

Verification

Release-readiness notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant