Constructed-adversarial accuracy benchmark (blocked-recall = 1.0)#220
Merged
Conversation
…1.0)
The first published, non-circular verdict-accuracy result — the "verdict
provably right" north-star artifact, and the half real-history mining can't
supply (W25: 9 decided / 241 PRs, almost no must_block positives).
Each bundled fixture was built to be a specific case, so its label is the
fixture's documented design intent (external ground truth), not a post-hoc
opinion about the engine's output — scoring the engine's verdict against those
definitional labels is non-circular. benchmark/miner/constructed.py runs the
curated fixtures via `fixture run` and records each verdict as a score-able
row, reusing the same confusion-matrix code as the mined corpus.
Result on the 7-case curated set (results/constructed.{jsonl,labels.csv}):
- blocked_recall = 1.0 (3/3 must_block → blocked)
- benign_escalation_rate = 0.0 (0/2 safe → allow)
- needs_human_caught = 1.0 (2/2 → review / insufficient_evidence)
CLI: `python -m benchmark.miner constructed` (re)generates it. Tests: a fast
committed-data pin of the headline + a live-engine regression that re-runs the
fixtures so a future change that breaks a blocked verdict fails in CI, not
silently in the data file. README documents the matrix and that the mined runs
supply the complementary negative-control + extraction-coverage halves.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… (PR #220 review) Review P2: the miner's child `python -m agents_shipgate` runs only pin AGENTS_SHIPGATE_AGENT_MODE; they relied on PYTHONPATH being set (conftest does this under pytest, nothing does outside it). So the documented `python -m benchmark.miner constructed` (and mine/evaluate) could resolve to an older installed agents-shipgate. Reproduced: with src/ off the path the child hit the editable 0.8.0/0.11.0 and 4 fixtures errored "no report.json". Fix at the shared layer: cli_env() (was evaluate._cli_env) now prepends the checkout's src/ to the child PYTHONPATH, so the child imports THIS source tree regardless of what's installed. constructed.py drops its weaker local copy and uses the shared cli_env, so mine / evaluate / constructed are all hermetic. Tests: a deterministic unit test that cli_env puts <repo>/src first, and a CLI-level integration test that regenerates the constructed corpus with src/ removed from the parent env and still exits 0 with all 7 rows. The repro now succeeds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
#220 review) Review P2: the prior cli_env() fix made child `python -m agents_shipgate` subprocesses import this checkout, but evaluate_pr() imports `agents_shipgate.triggers` in the PARENT process to compute run/skip. Run by hand with src/ off the parent path (an older installed wheel, or an editable .pth for a different checkout), `mine`/`evaluate` decided triggers from a stale catalog before the hermetic subprocesses ran. Fix: _ensure_repo_src_on_path() prepends this checkout's src/ to sys.path immediately before that lazy import (idempotent; no-op under pytest). Verified end-to-end: with PYTHONPATH=<repo-root only>, the CLI parent now imports agents_shipgate 0.13.0 from this worktree's src and its trigger decision matches the in-process result exactly. Tests: a deterministic unit test that the helper front-loads src when absent, and a CLI-level `evaluate` integration test run with src/ excluded from the parent env (exit 0, trigger decision from this checkout). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The first published, non-circular verdict-accuracy result — the "verdict provably right" north-star artifact. It's the half real-history mining can't supply: W25 showed real merged PRs almost never contain a
must_blockcapability change (9 decided / 241 PRs), so the benchmark's positives come from the repo's bundled fixtures.The key property is non-circularity: each fixture was built to be a specific case, so its label is the fixture's documented design intent (external ground truth) — not a post-hoc opinion about what the engine returned. Scoring the engine's verdict against those definitional labels is a legitimate measurement, not grading-its-own-homework.
Result (7-case curated set)
safe_to_mergeneeds_humanmust_blockblocked_recallbenign_escalation_rateneeds_human_caughtA clean confusion matrix: the gate blocks what's known-unsafe, lets through what's known-safe, and routes the ambiguous to a human.
How it fits
This is one of three strata the W25 finding called for. The committed mined runs supply the other two — the negative control (the 226 trigger-skips) and the real-history extraction-coverage (
insufficient_evidence) rate. Together: blocked-recall from constructed positives + noise/coverage from mined reality.What's here
benchmark/miner/constructed.py— runs the curated bundled fixtures viafixture run, records each verdict as a score-able row, reusing the same confusion-matrix code as the mined corpus.python -m benchmark.miner constructedto (re)generate; committedresults/constructed.{jsonl,csv,labels.csv}.test_live_engine_still_produces_the_constructed_verdicts) that re-runs the fixtures so a change regressing a blocked verdict fails in CI, not silently in the data file.Type
src/change)Verification
CI is authoritative for
python -m ruff check .,python -m compileall -q src tests, andpython -m pytest.test_miner+test_miner_labels+test_miner_corpus+test_miner_constructed, 34 tests) green; ruff clean;git diff --checkclean on the new artifacts.Release-readiness notes