Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions benchmark/miner/CALIBRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,9 @@ unusable for calibration even descriptively:
- `tools_scanned` was captured from the wrong place (`summary`, which carries
no tool count) and came back `null` on every row — the ratio denominator was
missing entirely. **Fixed** in `evaluate._tool_count` (now reads
`tool_surface.total_tools`); future mines record it. The committed corpus
predates the fix and still shows `null` — re-mine to populate.
`tool_surface.total_tools`) and **validated on real data** by the 2026-W26
run, whose decided rows all record it (W24/W25 predate the fix and still show
`null`). The denominator is now captured; the *labels* still are not.
- The row schema records `evidence_gaps` (low-confidence tools **+** source
warnings, combined) but not the split, so the two threshold terms can't be
separated. Splitting them is a `MinedRow` schema change, which forces a full
Expand Down
49 changes: 41 additions & 8 deletions benchmark/miner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,19 +81,21 @@ python -m benchmark.miner evaluate \
metrics are in [`LABELING.md`](LABELING.md). Generate the turnkey worksheet
with `python -m benchmark.miner labels`; a ready blank copy is committed
alongside each run's results as `<run>.labels.template.csv` (one per run in
the table below — currently `2026-W24-…` and `2026-W25-…`). Label the run
the table below — currently `2026-W24-…`, `2026-W25-…`, and `2026-W26-…`). Label the run
you mean to score, then `python -m benchmark.miner score --results <jsonl>
--labels <csv>` prints the confusion matrix + headline accuracy metrics.

| Run | Date | Repos | Rows | Notes |
|---|---|---|---|---|
| [`2026-W24-mined.csv`](results/2026-W24-mined.csv) | 2026-06-12 | stripe/ai, openai/openai-agents-python, crewAIInc/crewAI-examples | 121 (latest 40 merged PRs each + stripe/ai#232) | Schema v0.2 (re-run with baseline-gated `verify_*` receipts; supersedes the v0.1 artifact in place). Findings below. |
| [`2026-W25-mined.csv`](results/2026-W25-mined.csv) | 2026-06-12 | google/adk-samples, langchain-ai/langgraph, modelcontextprotocol/servers | 120 (latest 40 merged PRs each) | Widen run over 3 new framework families. Schema v0.2. Findings below. |
| [`2026-W26-mined.csv`](results/2026-W26-mined.csv) | 2026-06-16 | stripe/agent-toolkit, block/goose, pydantic/pydantic-ai | 120 (latest 40 merged PRs each) | Deepen run over agent **apps/toolkits**. First run with `tools_scanned` captured (#223); decided rows are cold-start `head_decision=review_required` but `verify`-effective `insufficient_evidence`. Schema v0.2. Findings below. |

## Constructed-adversarial accuracy — the blocked-recall proof

Real merged PRs almost never contain a `must_block` capability change (W25:
9 decided / 241), so the accuracy benchmark's **positives** come from the
Real merged PRs almost never contain a `must_block` capability change (15
decided / 361 across W24–W26, and **zero** of those 15 are `must_block`), so
the accuracy benchmark's **positives** come from the
repo's bundled fixtures, each built to be a specific case. The labels are each
fixture's **documented design intent** — external ground truth, not a post-hoc
opinion about the engine's output — so scoring the engine's verdict against
Expand All @@ -120,15 +122,46 @@ with `python -m benchmark.miner constructed --out … --labels-out …`; score w
The live engine is re-run against these fixtures in CI
(`tests/test_miner_constructed.py`), so a change that regresses a blocked
verdict fails there rather than silently in the data file. The mined runs below
supply the complementary halves — the **negative control** (the 226
supply the complementary halves — the **negative control** (the 336
trigger-skips) and the real-history **extraction-coverage** (`insufficient_evidence`) rate.

### 2026-W26 findings — deepen run over agent apps/toolkits

- **App/toolkit repos do yield more decided rows than framework cores — but
thin, IE-effective, and never `must_block`.** `stripe/agent-toolkit` produced
6 decided rows (15% of its 40 PRs) vs **0** from `block/goose` and
`pydantic/pydantic-ai` (a framework core, the same library-internals-churn
pattern as W25). Their cold-start `head_decision` is `review_required`, but
the per-PR `verify` receipt — **the verdict the accuracy scorer uses**
(`labels.effective_verdict` = `verify_verdict or head_decision`) — is
`insufficient_evidence` for all 6 (the toolkit surface still isn't statically
resolvable on the base→head diff). So the *scored* verdict mix stays
IE-dominated; W26 adds **no** scored `review_required` cases. And the 6
collapse to **one** distinct pattern (a repeated automated "sync skills from
docs.stripe.com" bot PR), so decided *diversity* added is ≈1. Net: even the
best app/toolkit repo's decided rows are effectively IE once verified.
- **`tools_scanned` capture validated on real data (#223).** Every decided row
records the ratio denominator (`tools_scanned=2`); pinned by
`test_w26_headline_numbers_reproduce_from_committed_data`. This is the first
committed run where the IE-threshold ratio is computable from the data.
- **Engine-robustness bug found:** `block/goose`'s OpenAPI spec crashes `scan`
with `Config error: Duplicate action_surface action_id` (4 `scan_failed`
rows) — the OpenAPI action_id is built from method+path without the
operationId, so two operations on `GET /sessions/{session_id}` collide. A
third-party spec must never hard-crash a scan; chipped as a follow-up (same
fail-soft class as #212/#214).
- **Confirms the W25 implication:** mining agent **application/toolkit** repos
is the only real-history source of more decided rows, but it does **not**
surface `must_block` positives — those still must come from the
constructed-adversarial stratum.

### 2026-W25 findings — diminishing returns from framework-core breadth

- **The base rate of capability-changing merged PRs is low, and now quantified.**
Across both runs — **6 repos / 241 merged PRs — 226 (93%) organically
trigger-skip and only 9 are decided.** The trigger noise bound is strongly
validated on real history; but real-history mining is an *inefficient* source
Across all three runs — **9 repos / 361 merged PRs — 336 (93%) organically
trigger-skip and only 15 are decided** (none `must_block`). The trigger noise
bound is strongly validated on real history; but real-history mining is an
*inefficient* source
of decided cases, especially from framework **cores**: `langgraph` and
`modelcontextprotocol/servers` produced **zero** decided rows (library-
internals churn and TS-MCP sources the static extractor doesn't resolve).
Expand All @@ -141,7 +174,7 @@ trigger-skips) and the real-history **extraction-coverage** (`insufficient_evide
real-world gap — extraction *coverage*, failing safe, not a wrong verdict.
- **Implication for the accuracy corpus (P3):** do not chase decided
*positives* by mining more framework cores. The labeled corpus should compose
three strata — mined-real for the **negative** control (the 226 trigger-skips)
three strata — mined-real for the **negative** control (the 336 trigger-skips)
and IE/coverage cases; **constructed-adversarial** for the `must_block`
positives (already seeded: `samples/_anti_patterns`,
`tests/fixtures/stripe_pr232`, `tests/test_verifier_scenarios.py`,
Expand Down
Loading