Deepen real-history mining: 2026-W26 over agent apps/toolkits by pengfei-threemoonslab · Pull Request #224 · ThreeMoonsLab/agents-shipgate

pengfei-threemoonslab · 2026-06-16T22:34:55Z

Continues the verdict-accuracy evidence base (roadmap Phase 3) by mining agent application/toolkit repos — the W25 lesson was that framework cores yield ~0 decided rows.

Run

stripe/agent-toolkit + block/goose + pydantic/pydantic-ai, latest 40 merged PRs each (120 rows), committed under benchmark/miner/results/2026-W26-mined.* with a blank labeling worksheet.

Findings

App/toolkit > cores, but thin and effectively IE. stripe/agent-toolkit gave 6 decided rows (15%); block/goose + pydantic/pydantic-ai (a framework core) gave 0. The 6 have a cold-start head_decision of review_required, but the per-PR verify receipt — the verdict the accuracy scorer uses (labels.effective_verdict = verify_verdict or head_decision) — is insufficient_evidence for all 6. So W26 adds no scored review_required cases; the effective verdict mix stays IE-dominated. The 6 also collapse to one pattern (a repeated automated "sync skills from docs.stripe.com" bot PR), so decided diversity added ≈ 1. Net: even the best app/toolkit repo's decided rows are effectively IE once verified.
Aggregate is now 9 repos / 361 PRs / 336 (93%) trigger-skip / 15 decided — and zero of the 15 are must_block. This reconfirms the standing conclusion: must_block positives must come from the constructed-adversarial stratum, not real-history mining.
Validates the IE-threshold: fix tools_scanned capture, examine + hold the constants #223 tools_scanned fix on real data. Every W26 decided row records the IE-ratio denominator (tools_scanned); W24/W25 predate the fix and remain null. Pinned by test_w26_headline_numbers_reproduce_from_committed_data; noted in CALIBRATION.md.
Engine-robustness bug found (chipped, not fixed here). block/goose's OpenAPI spec crashes scan with Config error: Duplicate action_surface action_id (4 scan_failed rows): the OpenAPI action_id is built from method+path without the operationId, so two operations on GET /sessions/{session_id} collide. A third-party spec must fail soft, not hard-crash — same class as Add the merged-PR history miner + first real run (WS-P) #212/init: never write a tool source that scan's input adapters reject #214.

Guard

W26 headline test asserts both the cold-start head_decision (review_required) and the scored effective_verdict (insufficient_evidence), plus tools_scanned populated — so the docs can't drift back to a "non-IE" framing.
Cross-run trigger-skip floor raised 241 → 361.
New corpus files are LF-only.

No product surface, no schema change. Full suite + ruff clean.

🤖 Generated with Claude Code

Continues the verdict-accuracy evidence base by mining agent application/toolkit repos (the W25 lesson: framework cores yield ~0 decided). Run: stripe/agent-toolkit + block/goose + pydantic/pydantic-ai, 40 PRs each (120 rows). Results committed under benchmark/miner/results/2026-W26-mined.* + a blank labeling worksheet. Findings: - App/toolkit > cores, but thin: stripe/agent-toolkit gave 6 decided rows (15%); goose + pydantic-ai (a framework core) gave 0. The 6 are the first real `review_required` decided rows (cores only ever yielded IE), but they collapse to one pattern (a repeated automated skill-sync bot PR) — decided *diversity* added ≈1. Aggregate is now 9 repos / 361 PRs / 336 (93%) trigger-skip / 15 decided, and ZERO of the 15 are must_block — reconfirming must_block positives must come from the constructed stratum. - Validates the #223 tools_scanned capture fix on REAL data: every W26 decided row records the ratio denominator (W24/W25 predate the fix, still null). Pinned by test_w26_headline_numbers_reproduce_from_committed_data; noted in CALIBRATION.md. - Engine-robustness bug found (chipped, not fixed here): block/goose's OpenAPI spec crashes scan with `Config error: Duplicate action_surface action_id` (4 scan_failed) — action_id is method+path without operationId, so two ops on GET /sessions/{session_id} collide. A third-party spec must fail soft. Corpus guard: W26 headline test added; cross-run trigger-skip floor raised 241 -> 361. Files LF-only. Full suite + ruff clean; no product/schema change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…aming) Addresses the P2 review finding on PR #224. The accuracy scorer uses `labels.effective_verdict` = `verify_verdict or head_decision`, and all 6 W26 decided rows have `verify_verdict=insufficient_evidence` (only the cold-start `head_decision` is review_required). So W26 adds NO scored review_required cases — the effective verdict mix stays IE-dominated. The earlier "first real review_required / non-IE decided rows" framing was misleading. - README: table note + W26 findings now state the cold-start head_decision is review_required but the verify-effective (scored) verdict is IE for all 6, so the scored mix is unchanged; net finding = even the best app/toolkit repo's decided rows are effectively IE once verified. - test_w26_headline: assert BOTH head_decision == review_required AND effective_verdict == insufficient_evidence, so the docs can't drift back to the "non-IE" framing. No data regenerated; verdicts unchanged. Full suite + ruff clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pengfei-threemoonslab and others added 2 commits June 16, 2026 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepen real-history mining: 2026-W26 over agent apps/toolkits#224

Deepen real-history mining: 2026-W26 over agent apps/toolkits#224
pengfei-threemoonslab wants to merge 2 commits into
mainfrom
claude/w26-deepen-mining

pengfei-threemoonslab commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pengfei-threemoonslab commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Run

Findings

Guard

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pengfei-threemoonslab commented Jun 16, 2026 •

edited

Loading