IE-threshold: fix tools_scanned capture, examine + hold the constants#223
Merged
Merged
Conversation
Closes out the two Phase 2c follow-ups (extraction_coverage ratio surface + IE-threshold calibration) — both with the surface-discipline gate applied rather than presumed. IE-threshold calibration (benchmark/miner/CALIBRATION.md). Attempted to calibrate `_LOW_CONFIDENCE_TOOL_RATIO=0.5` / `_MAX_TOLERATED_SOURCE_WARNINGS=3` (shipped v0.14, never tuned) against the corpus. Finding: the data cannot justify a change. - Real corpus: 241 PRs, 9 decided, IE-dominated, UNLABELED (the human pass is unstarted), and the threshold's ratio denominator was never captured. - Constructed labeled corpus: only 1 of 7 fixtures (openai_agents_sdk_agent) exercises the IE threshold, and it sits at the robust extreme (every tool low-confidence, ratio 1.0) — a single point that cannot distinguish 0.3 vs 0.5 vs 0.7. The source-warning constant is exercised by no labeled case. Decision: HOLD 0.5/3, now examined + guarded instead of unexamined. Revisit conditions (labeling pass + re-mine) documented in CALIBRATION.md. tools_scanned capture bug (real bug, found while calibrating). evaluate.py read the total tool count from `summary` (ReportSummary carries no tool count) so it was null on every mined row, blanking the ratio denominator. Fixed: read `tool_surface.total_tools` (tool_inventory length fallback), refactored the head -report parse into the pure, unit-tested `_record_head_report`. Committed corpus predates the fix and still shows null — future mines populate it. Guards: - tests/test_miner_constructed.py: pins the one labeled IE point to the live `_LOW_CONFIDENCE_TOOL_RATIO`, so an extraction improvement that resolves the dynamic surface — or a threshold edit — surfaces in CI. - tests/test_miner.py: lock the tools_scanned capture fix. extraction_coverage ratio report field — CONSIDERED AND DECLINED (gate analysis in CALIBRATION.md). Moves no headline metric (the IE rate is moved by the threshold, not by exposing the ratio); fully derivable from existing report fields (evidence_coverage.low_confidence_tool_count + tool_surface.total_tools); "legibility/completeness" is a rejected justification and a new field is a schema bump. No code added. No product surface added; no schema bump; full suite + ruff + schema-drift clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Addresses the P3 review finding on PR #223. The fixture openai_agents_sdk_agent has low == total == 2, so `low >= threshold` holds for ratio 0.3/0.5/0.7/1.0 alike — that assertion never catches a threshold edit, so claiming (in CALIBRATION.md, the test docstring, and the constants comment) that a threshold edit "surfaces in CI" was overstated. Fix — two guards with distinct, accurately-stated jobs: - Threshold edits: new `test_ie_threshold_constants_are_frozen` (tests/test_release_decision.py) asserts the constants equal 0.5 / 3. Mutation-checked: changing the ratio to 0.3 fails it. This is what actually makes a threshold edit surface in CI, forcing the deliberate recalibration path (update CALIBRATION.md + the labeling/re-mine prerequisites). - Extraction regressions: the fixture test keeps its real job — re-running the one labeled IE case so an extraction improvement that resolves the dynamic surface flips the verdict and fails. Its docstring now says it does NOT guard threshold edits (the point sits at ratio 1.0, robust across (0,1]). Reworded the matching claims in benchmark/miner/CALIBRATION.md and the release_decision.py constants comment to point at the freeze guard for edits and the fixture guard for extraction. Full suite + ruff clean; no schema/behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
pengfei-threemoonslab
added a commit
that referenced
this pull request
Jun 17, 2026
* Deepen real-history mining: 2026-W26 over agent apps/toolkits Continues the verdict-accuracy evidence base by mining agent application/toolkit repos (the W25 lesson: framework cores yield ~0 decided). Run: stripe/agent-toolkit + block/goose + pydantic/pydantic-ai, 40 PRs each (120 rows). Results committed under benchmark/miner/results/2026-W26-mined.* + a blank labeling worksheet. Findings: - App/toolkit > cores, but thin: stripe/agent-toolkit gave 6 decided rows (15%); goose + pydantic-ai (a framework core) gave 0. The 6 are the first real `review_required` decided rows (cores only ever yielded IE), but they collapse to one pattern (a repeated automated skill-sync bot PR) — decided *diversity* added ≈1. Aggregate is now 9 repos / 361 PRs / 336 (93%) trigger-skip / 15 decided, and ZERO of the 15 are must_block — reconfirming must_block positives must come from the constructed stratum. - Validates the #223 tools_scanned capture fix on REAL data: every W26 decided row records the ratio denominator (W24/W25 predate the fix, still null). Pinned by test_w26_headline_numbers_reproduce_from_committed_data; noted in CALIBRATION.md. - Engine-robustness bug found (chipped, not fixed here): block/goose's OpenAPI spec crashes scan with `Config error: Duplicate action_surface action_id` (4 scan_failed) — action_id is method+path without operationId, so two ops on GET /sessions/{session_id} collide. A third-party spec must fail soft. Corpus guard: W26 headline test added; cross-run trigger-skip floor raised 241 -> 361. Files LF-only. Full suite + ruff clean; no product/schema change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * W26 review: scored verdict is IE, not review_required (correct the framing) Addresses the P2 review finding on PR #224. The accuracy scorer uses `labels.effective_verdict` = `verify_verdict or head_decision`, and all 6 W26 decided rows have `verify_verdict=insufficient_evidence` (only the cold-start `head_decision` is review_required). So W26 adds NO scored review_required cases — the effective verdict mix stays IE-dominated. The earlier "first real review_required / non-IE decided rows" framing was misleading. - README: table note + W26 findings now state the cold-start head_decision is review_required but the verify-effective (scored) verdict is IE for all 6, so the scored mix is unchanged; net finding = even the best app/toolkit repo's decided rows are effectively IE once verified. - test_w26_headline: assert BOTH head_decision == review_required AND effective_verdict == insufficient_evidence, so the docs can't drift back to the "non-IE" framing. No data regenerated; verdicts unchanged. Full suite + ruff clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * W26 docs: correct helper name to labels.effective_verdict Addresses the P3 review nit on PR #224. The README referenced `labels._effective_verdict`; the actual helper (and the test) use `effective_verdict` (no leading underscore). Doc-only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes out the two Phase 2c follow-ups — with the surface-discipline gate applied, not presumed.
1. IE-threshold calibration → HOLD
0.5/3(examined, not unexamined)_LOW_CONFIDENCE_TOOL_RATIO=0.5/_MAX_TOLERATED_SOURCE_WARNINGS=3shipped in v0.14 and were never tuned. I tried to calibrate them against the corpus. The data cannot justify a change:results/2026-W24/W25): 241 PRs, 9 decided, IE-dominated, unlabeled (theLABELING.mdhuman pass is unstarted), and the ratio denominator was never captured (see bug below).openai_agents_sdk_agent) exercises the IE threshold, and it sits at the robust extreme — every tool low-confidence, ratio 1.0 — a single point that can't distinguish0.3vs0.5vs0.7. The source-warning constant is exercised by no labeled case.Full analysis, the measured evidence table, and the revisit conditions are in the new
benchmark/miner/CALIBRATION.md. The constants' docstring now points there ("examined and held") instead of being silent.Why not build a calibration harness?
test_miner_constructed.pyalready re-runs the fixtures and guards their verdicts — a separatecalibratecommand processing one meaningful data point would be redundant marginal infra, exactly the discipline the repo enforces. Instead I pinned the single labeled threshold point directly.2.
tools_scannedcapture bug (found while calibrating)evaluate.pyread the total tool count fromsummary— butReportSummarycarries no tool count, sotools_scannedwas null on every mined row, blanking the ratio denominator. Fixed to readtool_surface.total_tools(withtool_inventorylength fallback) and refactored the head-report parse into the pure, unit-tested_record_head_report. (The committed corpus predates the fix; future mines populate it.)3.
extraction_coverageratio report field → CONSIDERED AND DECLINEDGate analysis (recorded in CALIBRATION.md): moves no headline metric (the IE rate is moved by the threshold, not by exposing the ratio); fully derivable from fields the report already carries (
evidence_coverage.low_confidence_tool_count+tool_surface.total_tools, andevidence_gaps[]already enumerates each gap); "legibility/completeness" is a rejected justification and a new field is a schema bump. No code added.Guards
test_ie_threshold_is_exercised_and_robust_on_the_labeled_coverage_fixture— pins the one labeled IE point to the live constant; an extraction improvement that resolves the surface (or a threshold edit) surfaces in CI.test_record_head_report_*— lock thetools_scannedcapture fix.No product surface added; no schema bump; full suite + ruff + schema-drift clean.
🤖 Generated with Claude Code