Skip to content

Constructed-adversarial accuracy benchmark (blocked-recall = 1.0)#220

Merged
pengfei-threemoonslab merged 3 commits into
mainfrom
claude/wsp-accuracy-constructed
Jun 16, 2026
Merged

Constructed-adversarial accuracy benchmark (blocked-recall = 1.0)#220
pengfei-threemoonslab merged 3 commits into
mainfrom
claude/wsp-accuracy-constructed

Conversation

@pengfei-threemoonslab

Copy link
Copy Markdown
Contributor

Summary

The first published, non-circular verdict-accuracy result — the "verdict provably right" north-star artifact. It's the half real-history mining can't supply: W25 showed real merged PRs almost never contain a must_block capability change (9 decided / 241 PRs), so the benchmark's positives come from the repo's bundled fixtures.

The key property is non-circularity: each fixture was built to be a specific case, so its label is the fixture's documented design intent (external ground truth) — not a post-hoc opinion about what the engine returned. Scoring the engine's verdict against those definitional labels is a legitimate measurement, not grading-its-own-homework.

Result (7-case curated set)

label \ verdict allow review insufficient_evidence block
safe_to_merge 2 0 0 0
needs_human 0 1 1 0
must_block 0 0 0 3
Metric Value
blocked_recall 1.0 (3/3 known-unsafe → blocked)
benign_escalation_rate 0.0 (0/2 known-safe → allow)
needs_human_caught 1.0 (2/2 → review / insufficient_evidence, never auto-passed)

A clean confusion matrix: the gate blocks what's known-unsafe, lets through what's known-safe, and routes the ambiguous to a human.

How it fits

This is one of three strata the W25 finding called for. The committed mined runs supply the other two — the negative control (the 226 trigger-skips) and the real-history extraction-coverage (insufficient_evidence) rate. Together: blocked-recall from constructed positives + noise/coverage from mined reality.

What's here

  • benchmark/miner/constructed.py — runs the curated bundled fixtures via fixture run, records each verdict as a score-able row, reusing the same confusion-matrix code as the mined corpus.
  • CLI python -m benchmark.miner constructed to (re)generate; committed results/constructed.{jsonl,csv,labels.csv}.
  • Tests: a fast committed-data pin of the headline, plus a live-engine regression (test_live_engine_still_produces_the_constructed_verdicts) that re-runs the fixtures so a change regressing a blocked verdict fails in CI, not silently in the data file.

Type

  • Check or risk-model change
  • Input adapter change
  • CLI or GitHub Action behavior
  • Report, schema, or SARIF output
  • Documentation only (benchmark tooling + tests over bundled fixtures; no src/ change)

Verification

CI is authoritative for python -m ruff check ., python -m compileall -q src tests, and python -m pytest.

  • Full miner suite (test_miner + test_miner_labels + test_miner_corpus + test_miner_constructed, 34 tests) green; ruff clean; git diff --check clean on the new artifacts.
  • The live-engine regression re-runs all 7 fixtures and confirms the committed corpus matches.

Release-readiness notes

  • No user-code import added to default scan paths
  • No network access added to default scan paths (runs bundled fixtures locally)
  • New or changed check IDs documented (n/a)
  • Report/schema changes are additive or documented (n/a — benchmark-local)

pengfei-threemoonslab and others added 3 commits June 14, 2026 22:11
…1.0)

The first published, non-circular verdict-accuracy result — the "verdict
provably right" north-star artifact, and the half real-history mining can't
supply (W25: 9 decided / 241 PRs, almost no must_block positives).

Each bundled fixture was built to be a specific case, so its label is the
fixture's documented design intent (external ground truth), not a post-hoc
opinion about the engine's output — scoring the engine's verdict against those
definitional labels is non-circular. benchmark/miner/constructed.py runs the
curated fixtures via `fixture run` and records each verdict as a score-able
row, reusing the same confusion-matrix code as the mined corpus.

Result on the 7-case curated set (results/constructed.{jsonl,labels.csv}):
- blocked_recall        = 1.0  (3/3 must_block → blocked)
- benign_escalation_rate = 0.0 (0/2 safe → allow)
- needs_human_caught    = 1.0  (2/2 → review / insufficient_evidence)

CLI: `python -m benchmark.miner constructed` (re)generates it. Tests: a fast
committed-data pin of the headline + a live-engine regression that re-runs the
fixtures so a future change that breaks a blocked verdict fails in CI, not
silently in the data file. README documents the matrix and that the mined runs
supply the complementary negative-control + extraction-coverage halves.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… (PR #220 review)

Review P2: the miner's child `python -m agents_shipgate` runs only pin
AGENTS_SHIPGATE_AGENT_MODE; they relied on PYTHONPATH being set (conftest does
this under pytest, nothing does outside it). So the documented
`python -m benchmark.miner constructed` (and mine/evaluate) could resolve to an
older installed agents-shipgate. Reproduced: with src/ off the path the child
hit the editable 0.8.0/0.11.0 and 4 fixtures errored "no report.json".

Fix at the shared layer: cli_env() (was evaluate._cli_env) now prepends the
checkout's src/ to the child PYTHONPATH, so the child imports THIS source tree
regardless of what's installed. constructed.py drops its weaker local copy and
uses the shared cli_env, so mine / evaluate / constructed are all hermetic.

Tests: a deterministic unit test that cli_env puts <repo>/src first, and a
CLI-level integration test that regenerates the constructed corpus with src/
removed from the parent env and still exits 0 with all 7 rows. The repro now
succeeds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
#220 review)

Review P2: the prior cli_env() fix made child `python -m agents_shipgate`
subprocesses import this checkout, but evaluate_pr() imports
`agents_shipgate.triggers` in the PARENT process to compute run/skip. Run by
hand with src/ off the parent path (an older installed wheel, or an editable
.pth for a different checkout), `mine`/`evaluate` decided triggers from a stale
catalog before the hermetic subprocesses ran.

Fix: _ensure_repo_src_on_path() prepends this checkout's src/ to sys.path
immediately before that lazy import (idempotent; no-op under pytest). Verified
end-to-end: with PYTHONPATH=<repo-root only>, the CLI parent now imports
agents_shipgate 0.13.0 from this worktree's src and its trigger decision
matches the in-process result exactly.

Tests: a deterministic unit test that the helper front-loads src when absent,
and a CLI-level `evaluate` integration test run with src/ excluded from the
parent env (exit 0, trigger decision from this checkout).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@pengfei-threemoonslab pengfei-threemoonslab merged commit 8ab1706 into main Jun 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant