Deepen real-history mining: 2026-W26 over agent apps/toolkits#224
Open
pengfei-threemoonslab wants to merge 2 commits into
Open
Deepen real-history mining: 2026-W26 over agent apps/toolkits#224pengfei-threemoonslab wants to merge 2 commits into
pengfei-threemoonslab wants to merge 2 commits into
Conversation
Continues the verdict-accuracy evidence base by mining agent application/toolkit repos (the W25 lesson: framework cores yield ~0 decided). Run: stripe/agent-toolkit + block/goose + pydantic/pydantic-ai, 40 PRs each (120 rows). Results committed under benchmark/miner/results/2026-W26-mined.* + a blank labeling worksheet. Findings: - App/toolkit > cores, but thin: stripe/agent-toolkit gave 6 decided rows (15%); goose + pydantic-ai (a framework core) gave 0. The 6 are the first real `review_required` decided rows (cores only ever yielded IE), but they collapse to one pattern (a repeated automated skill-sync bot PR) — decided *diversity* added ≈1. Aggregate is now 9 repos / 361 PRs / 336 (93%) trigger-skip / 15 decided, and ZERO of the 15 are must_block — reconfirming must_block positives must come from the constructed stratum. - Validates the #223 tools_scanned capture fix on REAL data: every W26 decided row records the ratio denominator (W24/W25 predate the fix, still null). Pinned by test_w26_headline_numbers_reproduce_from_committed_data; noted in CALIBRATION.md. - Engine-robustness bug found (chipped, not fixed here): block/goose's OpenAPI spec crashes scan with `Config error: Duplicate action_surface action_id` (4 scan_failed) — action_id is method+path without operationId, so two ops on GET /sessions/{session_id} collide. A third-party spec must fail soft. Corpus guard: W26 headline test added; cross-run trigger-skip floor raised 241 -> 361. Files LF-only. Full suite + ruff clean; no product/schema change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…aming) Addresses the P2 review finding on PR #224. The accuracy scorer uses `labels.effective_verdict` = `verify_verdict or head_decision`, and all 6 W26 decided rows have `verify_verdict=insufficient_evidence` (only the cold-start `head_decision` is review_required). So W26 adds NO scored review_required cases — the effective verdict mix stays IE-dominated. The earlier "first real review_required / non-IE decided rows" framing was misleading. - README: table note + W26 findings now state the cold-start head_decision is review_required but the verify-effective (scored) verdict is IE for all 6, so the scored mix is unchanged; net finding = even the best app/toolkit repo's decided rows are effectively IE once verified. - test_w26_headline: assert BOTH head_decision == review_required AND effective_verdict == insufficient_evidence, so the docs can't drift back to the "non-IE" framing. No data regenerated; verdicts unchanged. Full suite + ruff clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Continues the verdict-accuracy evidence base (roadmap Phase 3) by mining agent application/toolkit repos — the W25 lesson was that framework cores yield ~0 decided rows.
Run
stripe/agent-toolkit+block/goose+pydantic/pydantic-ai, latest 40 merged PRs each (120 rows), committed underbenchmark/miner/results/2026-W26-mined.*with a blank labeling worksheet.Findings
stripe/agent-toolkitgave 6 decided rows (15%);block/goose+pydantic/pydantic-ai(a framework core) gave 0. The 6 have a cold-starthead_decisionofreview_required, but the per-PRverifyreceipt — the verdict the accuracy scorer uses (labels.effective_verdict = verify_verdict or head_decision) — isinsufficient_evidencefor all 6. So W26 adds no scoredreview_requiredcases; the effective verdict mix stays IE-dominated. The 6 also collapse to one pattern (a repeated automated "sync skills from docs.stripe.com" bot PR), so decided diversity added ≈ 1. Net: even the best app/toolkit repo's decided rows are effectively IE once verified.must_block. This reconfirms the standing conclusion:must_blockpositives must come from the constructed-adversarial stratum, not real-history mining.tools_scannedfix on real data. Every W26 decided row records the IE-ratio denominator (tools_scanned); W24/W25 predate the fix and remain null. Pinned bytest_w26_headline_numbers_reproduce_from_committed_data; noted inCALIBRATION.md.block/goose's OpenAPI spec crashesscanwithConfig error: Duplicate action_surface action_id(4scan_failedrows): the OpenAPI action_id is built from method+path without the operationId, so two operations onGET /sessions/{session_id}collide. A third-party spec must fail soft, not hard-crash — same class as Add the merged-PR history miner + first real run (WS-P) #212/init: never write a tool source that scan's input adapters reject #214.Guard
head_decision(review_required) and the scoredeffective_verdict(insufficient_evidence), plustools_scannedpopulated — so the docs can't drift back to a "non-IE" framing.241 → 361.No product surface, no schema change. Full suite + ruff clean.
🤖 Generated with Claude Code