IE-threshold: fix tools_scanned capture, examine + hold the constants by pengfei-threemoonslab · Pull Request #223 · ThreeMoonsLab/agents-shipgate

pengfei-threemoonslab · 2026-06-16T21:29:51Z

Closes out the two Phase 2c follow-ups — with the surface-discipline gate applied, not presumed.

1. IE-threshold calibration → HOLD `0.5` / `3` (examined, not unexamined)

_LOW_CONFIDENCE_TOOL_RATIO=0.5 / _MAX_TOLERATED_SOURCE_WARNINGS=3 shipped in v0.14 and were never tuned. I tried to calibrate them against the corpus. The data cannot justify a change:

Real corpus (results/2026-W24/W25): 241 PRs, 9 decided, IE-dominated, unlabeled (the LABELING.md human pass is unstarted), and the ratio denominator was never captured (see bug below).
Constructed labeled corpus: only 1 of 7 fixtures (openai_agents_sdk_agent) exercises the IE threshold, and it sits at the robust extreme — every tool low-confidence, ratio 1.0 — a single point that can't distinguish 0.3 vs 0.5 vs 0.7. The source-warning constant is exercised by no labeled case.

Full analysis, the measured evidence table, and the revisit conditions are in the new benchmark/miner/CALIBRATION.md. The constants' docstring now points there ("examined and held") instead of being silent.

Why not build a calibration harness? test_miner_constructed.py already re-runs the fixtures and guards their verdicts — a separate calibrate command processing one meaningful data point would be redundant marginal infra, exactly the discipline the repo enforces. Instead I pinned the single labeled threshold point directly.

2. `tools_scanned` capture bug (found while calibrating)

evaluate.py read the total tool count from summary — but ReportSummary carries no tool count, so tools_scanned was null on every mined row, blanking the ratio denominator. Fixed to read tool_surface.total_tools (with tool_inventory length fallback) and refactored the head-report parse into the pure, unit-tested _record_head_report. (The committed corpus predates the fix; future mines populate it.)

3. `extraction_coverage` ratio report field → CONSIDERED AND DECLINED

Gate analysis (recorded in CALIBRATION.md): moves no headline metric (the IE rate is moved by the threshold, not by exposing the ratio); fully derivable from fields the report already carries (evidence_coverage.low_confidence_tool_count + tool_surface.total_tools, and evidence_gaps[] already enumerates each gap); "legibility/completeness" is a rejected justification and a new field is a schema bump. No code added.

Guards

test_ie_threshold_is_exercised_and_robust_on_the_labeled_coverage_fixture — pins the one labeled IE point to the live constant; an extraction improvement that resolves the surface (or a threshold edit) surfaces in CI.
test_record_head_report_* — lock the tools_scanned capture fix.

No product surface added; no schema bump; full suite + ruff + schema-drift clean.

🤖 Generated with Claude Code

Closes out the two Phase 2c follow-ups (extraction_coverage ratio surface + IE-threshold calibration) — both with the surface-discipline gate applied rather than presumed. IE-threshold calibration (benchmark/miner/CALIBRATION.md). Attempted to calibrate `_LOW_CONFIDENCE_TOOL_RATIO=0.5` / `_MAX_TOLERATED_SOURCE_WARNINGS=3` (shipped v0.14, never tuned) against the corpus. Finding: the data cannot justify a change. - Real corpus: 241 PRs, 9 decided, IE-dominated, UNLABELED (the human pass is unstarted), and the threshold's ratio denominator was never captured. - Constructed labeled corpus: only 1 of 7 fixtures (openai_agents_sdk_agent) exercises the IE threshold, and it sits at the robust extreme (every tool low-confidence, ratio 1.0) — a single point that cannot distinguish 0.3 vs 0.5 vs 0.7. The source-warning constant is exercised by no labeled case. Decision: HOLD 0.5/3, now examined + guarded instead of unexamined. Revisit conditions (labeling pass + re-mine) documented in CALIBRATION.md. tools_scanned capture bug (real bug, found while calibrating). evaluate.py read the total tool count from `summary` (ReportSummary carries no tool count) so it was null on every mined row, blanking the ratio denominator. Fixed: read `tool_surface.total_tools` (tool_inventory length fallback), refactored the head -report parse into the pure, unit-tested `_record_head_report`. Committed corpus predates the fix and still shows null — future mines populate it. Guards: - tests/test_miner_constructed.py: pins the one labeled IE point to the live `_LOW_CONFIDENCE_TOOL_RATIO`, so an extraction improvement that resolves the dynamic surface — or a threshold edit — surfaces in CI. - tests/test_miner.py: lock the tools_scanned capture fix. extraction_coverage ratio report field — CONSIDERED AND DECLINED (gate analysis in CALIBRATION.md). Moves no headline metric (the IE rate is moved by the threshold, not by exposing the ratio); fully derivable from existing report fields (evidence_coverage.low_confidence_tool_count + tool_surface.total_tools); "legibility/completeness" is a rejected justification and a new field is a schema bump. No code added. No product surface added; no schema bump; full suite + ruff + schema-drift clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Addresses the P3 review finding on PR #223. The fixture openai_agents_sdk_agent has low == total == 2, so `low >= threshold` holds for ratio 0.3/0.5/0.7/1.0 alike — that assertion never catches a threshold edit, so claiming (in CALIBRATION.md, the test docstring, and the constants comment) that a threshold edit "surfaces in CI" was overstated. Fix — two guards with distinct, accurately-stated jobs: - Threshold edits: new `test_ie_threshold_constants_are_frozen` (tests/test_release_decision.py) asserts the constants equal 0.5 / 3. Mutation-checked: changing the ratio to 0.3 fails it. This is what actually makes a threshold edit surface in CI, forcing the deliberate recalibration path (update CALIBRATION.md + the labeling/re-mine prerequisites). - Extraction regressions: the fixture test keeps its real job — re-running the one labeled IE case so an extraction improvement that resolves the dynamic surface flips the verdict and fails. Its docstring now says it does NOT guard threshold edits (the point sits at ratio 1.0, robust across (0,1]). Reworded the matching claims in benchmark/miner/CALIBRATION.md and the release_decision.py constants comment to point at the freeze guard for edits and the fixture guard for extraction. Full suite + ruff clean; no schema/behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Deepen real-history mining: 2026-W26 over agent apps/toolkits Continues the verdict-accuracy evidence base by mining agent application/toolkit repos (the W25 lesson: framework cores yield ~0 decided). Run: stripe/agent-toolkit + block/goose + pydantic/pydantic-ai, 40 PRs each (120 rows). Results committed under benchmark/miner/results/2026-W26-mined.* + a blank labeling worksheet. Findings: - App/toolkit > cores, but thin: stripe/agent-toolkit gave 6 decided rows (15%); goose + pydantic-ai (a framework core) gave 0. The 6 are the first real `review_required` decided rows (cores only ever yielded IE), but they collapse to one pattern (a repeated automated skill-sync bot PR) — decided *diversity* added ≈1. Aggregate is now 9 repos / 361 PRs / 336 (93%) trigger-skip / 15 decided, and ZERO of the 15 are must_block — reconfirming must_block positives must come from the constructed stratum. - Validates the #223 tools_scanned capture fix on REAL data: every W26 decided row records the ratio denominator (W24/W25 predate the fix, still null). Pinned by test_w26_headline_numbers_reproduce_from_committed_data; noted in CALIBRATION.md. - Engine-robustness bug found (chipped, not fixed here): block/goose's OpenAPI spec crashes scan with `Config error: Duplicate action_surface action_id` (4 scan_failed) — action_id is method+path without operationId, so two ops on GET /sessions/{session_id} collide. A third-party spec must fail soft. Corpus guard: W26 headline test added; cross-run trigger-skip floor raised 241 -> 361. Files LF-only. Full suite + ruff clean; no product/schema change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * W26 review: scored verdict is IE, not review_required (correct the framing) Addresses the P2 review finding on PR #224. The accuracy scorer uses `labels.effective_verdict` = `verify_verdict or head_decision`, and all 6 W26 decided rows have `verify_verdict=insufficient_evidence` (only the cold-start `head_decision` is review_required). So W26 adds NO scored review_required cases — the effective verdict mix stays IE-dominated. The earlier "first real review_required / non-IE decided rows" framing was misleading. - README: table note + W26 findings now state the cold-start head_decision is review_required but the verify-effective (scored) verdict is IE for all 6, so the scored mix is unchanged; net finding = even the best app/toolkit repo's decided rows are effectively IE once verified. - test_w26_headline: assert BOTH head_decision == review_required AND effective_verdict == insufficient_evidence, so the docs can't drift back to the "non-IE" framing. No data regenerated; verdicts unchanged. Full suite + ruff clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * W26 docs: correct helper name to labels.effective_verdict Addresses the P3 review nit on PR #224. The README referenced `labels._effective_verdict`; the actual helper (and the test) use `effective_verdict` (no leading underscore). Doc-only.

pengfei-threemoonslab and others added 2 commits June 16, 2026 14:29

pengfei-threemoonslab merged commit 5b614ab into main Jun 16, 2026
2 checks passed

pengfei-threemoonslab deleted the claude/ie-threshold-calibration branch June 16, 2026 22:28

pengfei-threemoonslab mentioned this pull request Jun 16, 2026

Deepen real-history mining: 2026-W26 over agent apps/toolkits #224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IE-threshold: fix tools_scanned capture, examine + hold the constants#223

IE-threshold: fix tools_scanned capture, examine + hold the constants#223
pengfei-threemoonslab merged 2 commits into
mainfrom
claude/ie-threshold-calibration

pengfei-threemoonslab commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pengfei-threemoonslab commented Jun 16, 2026

1. IE-threshold calibration → HOLD 0.5 / 3 (examined, not unexamined)

2. tools_scanned capture bug (found while calibrating)

3. extraction_coverage ratio report field → CONSIDERED AND DECLINED

Guards

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. IE-threshold calibration → HOLD `0.5` / `3` (examined, not unexamined)

2. `tools_scanned` capture bug (found while calibrating)

3. `extraction_coverage` ratio report field → CONSIDERED AND DECLINED