Skip to content

Deepen real-history mining: 2026-W26 over agent apps/toolkits#224

Open
pengfei-threemoonslab wants to merge 2 commits into
mainfrom
claude/w26-deepen-mining
Open

Deepen real-history mining: 2026-W26 over agent apps/toolkits#224
pengfei-threemoonslab wants to merge 2 commits into
mainfrom
claude/w26-deepen-mining

Conversation

@pengfei-threemoonslab

@pengfei-threemoonslab pengfei-threemoonslab commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Continues the verdict-accuracy evidence base (roadmap Phase 3) by mining agent application/toolkit repos — the W25 lesson was that framework cores yield ~0 decided rows.

Run

stripe/agent-toolkit + block/goose + pydantic/pydantic-ai, latest 40 merged PRs each (120 rows), committed under benchmark/miner/results/2026-W26-mined.* with a blank labeling worksheet.

Findings

  • App/toolkit > cores, but thin and effectively IE. stripe/agent-toolkit gave 6 decided rows (15%); block/goose + pydantic/pydantic-ai (a framework core) gave 0. The 6 have a cold-start head_decision of review_required, but the per-PR verify receipt — the verdict the accuracy scorer uses (labels.effective_verdict = verify_verdict or head_decision) — is insufficient_evidence for all 6. So W26 adds no scored review_required cases; the effective verdict mix stays IE-dominated. The 6 also collapse to one pattern (a repeated automated "sync skills from docs.stripe.com" bot PR), so decided diversity added ≈ 1. Net: even the best app/toolkit repo's decided rows are effectively IE once verified.
  • Aggregate is now 9 repos / 361 PRs / 336 (93%) trigger-skip / 15 decided — and zero of the 15 are must_block. This reconfirms the standing conclusion: must_block positives must come from the constructed-adversarial stratum, not real-history mining.
  • Validates the IE-threshold: fix tools_scanned capture, examine + hold the constants #223 tools_scanned fix on real data. Every W26 decided row records the IE-ratio denominator (tools_scanned); W24/W25 predate the fix and remain null. Pinned by test_w26_headline_numbers_reproduce_from_committed_data; noted in CALIBRATION.md.
  • Engine-robustness bug found (chipped, not fixed here). block/goose's OpenAPI spec crashes scan with Config error: Duplicate action_surface action_id (4 scan_failed rows): the OpenAPI action_id is built from method+path without the operationId, so two operations on GET /sessions/{session_id} collide. A third-party spec must fail soft, not hard-crash — same class as Add the merged-PR history miner + first real run (WS-P) #212/init: never write a tool source that scan's input adapters reject #214.

Guard

  • W26 headline test asserts both the cold-start head_decision (review_required) and the scored effective_verdict (insufficient_evidence), plus tools_scanned populated — so the docs can't drift back to a "non-IE" framing.
  • Cross-run trigger-skip floor raised 241 → 361.
  • New corpus files are LF-only.

No product surface, no schema change. Full suite + ruff clean.

🤖 Generated with Claude Code

pengfei-threemoonslab and others added 2 commits June 16, 2026 15:34
Continues the verdict-accuracy evidence base by mining agent
application/toolkit repos (the W25 lesson: framework cores yield ~0 decided).

Run: stripe/agent-toolkit + block/goose + pydantic/pydantic-ai, 40 PRs each
(120 rows). Results committed under benchmark/miner/results/2026-W26-mined.*
+ a blank labeling worksheet.

Findings:
- App/toolkit > cores, but thin: stripe/agent-toolkit gave 6 decided rows
  (15%); goose + pydantic-ai (a framework core) gave 0. The 6 are the first
  real `review_required` decided rows (cores only ever yielded IE), but they
  collapse to one pattern (a repeated automated skill-sync bot PR) — decided
  *diversity* added ≈1. Aggregate is now 9 repos / 361 PRs / 336 (93%)
  trigger-skip / 15 decided, and ZERO of the 15 are must_block — reconfirming
  must_block positives must come from the constructed stratum.
- Validates the #223 tools_scanned capture fix on REAL data: every W26 decided
  row records the ratio denominator (W24/W25 predate the fix, still null).
  Pinned by test_w26_headline_numbers_reproduce_from_committed_data; noted in
  CALIBRATION.md.
- Engine-robustness bug found (chipped, not fixed here): block/goose's OpenAPI
  spec crashes scan with `Config error: Duplicate action_surface action_id`
  (4 scan_failed) — action_id is method+path without operationId, so two ops
  on GET /sessions/{session_id} collide. A third-party spec must fail soft.

Corpus guard: W26 headline test added; cross-run trigger-skip floor raised
241 -> 361. Files LF-only. Full suite + ruff clean; no product/schema change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…aming)

Addresses the P2 review finding on PR #224. The accuracy scorer uses
`labels.effective_verdict` = `verify_verdict or head_decision`, and all 6 W26
decided rows have `verify_verdict=insufficient_evidence` (only the cold-start
`head_decision` is review_required). So W26 adds NO scored review_required
cases — the effective verdict mix stays IE-dominated. The earlier "first real
review_required / non-IE decided rows" framing was misleading.

- README: table note + W26 findings now state the cold-start head_decision is
  review_required but the verify-effective (scored) verdict is IE for all 6, so
  the scored mix is unchanged; net finding = even the best app/toolkit repo's
  decided rows are effectively IE once verified.
- test_w26_headline: assert BOTH head_decision == review_required AND
  effective_verdict == insufficient_evidence, so the docs can't drift back to
  the "non-IE" framing.

No data regenerated; verdicts unchanged. Full suite + ruff clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant