Skip to content

IE-threshold: fix tools_scanned capture, examine + hold the constants#223

Merged
pengfei-threemoonslab merged 2 commits into
mainfrom
claude/ie-threshold-calibration
Jun 16, 2026
Merged

IE-threshold: fix tools_scanned capture, examine + hold the constants#223
pengfei-threemoonslab merged 2 commits into
mainfrom
claude/ie-threshold-calibration

Conversation

@pengfei-threemoonslab

Copy link
Copy Markdown
Contributor

Closes out the two Phase 2c follow-ups — with the surface-discipline gate applied, not presumed.

1. IE-threshold calibration → HOLD 0.5 / 3 (examined, not unexamined)

_LOW_CONFIDENCE_TOOL_RATIO=0.5 / _MAX_TOLERATED_SOURCE_WARNINGS=3 shipped in v0.14 and were never tuned. I tried to calibrate them against the corpus. The data cannot justify a change:

  • Real corpus (results/2026-W24/W25): 241 PRs, 9 decided, IE-dominated, unlabeled (the LABELING.md human pass is unstarted), and the ratio denominator was never captured (see bug below).
  • Constructed labeled corpus: only 1 of 7 fixtures (openai_agents_sdk_agent) exercises the IE threshold, and it sits at the robust extreme — every tool low-confidence, ratio 1.0 — a single point that can't distinguish 0.3 vs 0.5 vs 0.7. The source-warning constant is exercised by no labeled case.

Full analysis, the measured evidence table, and the revisit conditions are in the new benchmark/miner/CALIBRATION.md. The constants' docstring now points there ("examined and held") instead of being silent.

Why not build a calibration harness? test_miner_constructed.py already re-runs the fixtures and guards their verdicts — a separate calibrate command processing one meaningful data point would be redundant marginal infra, exactly the discipline the repo enforces. Instead I pinned the single labeled threshold point directly.

2. tools_scanned capture bug (found while calibrating)

evaluate.py read the total tool count from summary — but ReportSummary carries no tool count, so tools_scanned was null on every mined row, blanking the ratio denominator. Fixed to read tool_surface.total_tools (with tool_inventory length fallback) and refactored the head-report parse into the pure, unit-tested _record_head_report. (The committed corpus predates the fix; future mines populate it.)

3. extraction_coverage ratio report field → CONSIDERED AND DECLINED

Gate analysis (recorded in CALIBRATION.md): moves no headline metric (the IE rate is moved by the threshold, not by exposing the ratio); fully derivable from fields the report already carries (evidence_coverage.low_confidence_tool_count + tool_surface.total_tools, and evidence_gaps[] already enumerates each gap); "legibility/completeness" is a rejected justification and a new field is a schema bump. No code added.

Guards

  • test_ie_threshold_is_exercised_and_robust_on_the_labeled_coverage_fixture — pins the one labeled IE point to the live constant; an extraction improvement that resolves the surface (or a threshold edit) surfaces in CI.
  • test_record_head_report_* — lock the tools_scanned capture fix.

No product surface added; no schema bump; full suite + ruff + schema-drift clean.

🤖 Generated with Claude Code

pengfei-threemoonslab and others added 2 commits June 16, 2026 14:29
Closes out the two Phase 2c follow-ups (extraction_coverage ratio surface +
IE-threshold calibration) — both with the surface-discipline gate applied
rather than presumed.

IE-threshold calibration (benchmark/miner/CALIBRATION.md). Attempted to
calibrate `_LOW_CONFIDENCE_TOOL_RATIO=0.5` / `_MAX_TOLERATED_SOURCE_WARNINGS=3`
(shipped v0.14, never tuned) against the corpus. Finding: the data cannot
justify a change.
- Real corpus: 241 PRs, 9 decided, IE-dominated, UNLABELED (the human pass is
  unstarted), and the threshold's ratio denominator was never captured.
- Constructed labeled corpus: only 1 of 7 fixtures (openai_agents_sdk_agent)
  exercises the IE threshold, and it sits at the robust extreme (every tool
  low-confidence, ratio 1.0) — a single point that cannot distinguish 0.3 vs
  0.5 vs 0.7. The source-warning constant is exercised by no labeled case.
Decision: HOLD 0.5/3, now examined + guarded instead of unexamined. Revisit
conditions (labeling pass + re-mine) documented in CALIBRATION.md.

tools_scanned capture bug (real bug, found while calibrating). evaluate.py read
the total tool count from `summary` (ReportSummary carries no tool count) so it
was null on every mined row, blanking the ratio denominator. Fixed: read
`tool_surface.total_tools` (tool_inventory length fallback), refactored the head
-report parse into the pure, unit-tested `_record_head_report`. Committed corpus
predates the fix and still shows null — future mines populate it.

Guards:
- tests/test_miner_constructed.py: pins the one labeled IE point to the live
  `_LOW_CONFIDENCE_TOOL_RATIO`, so an extraction improvement that resolves the
  dynamic surface — or a threshold edit — surfaces in CI.
- tests/test_miner.py: lock the tools_scanned capture fix.

extraction_coverage ratio report field — CONSIDERED AND DECLINED (gate analysis
in CALIBRATION.md). Moves no headline metric (the IE rate is moved by the
threshold, not by exposing the ratio); fully derivable from existing report
fields (evidence_coverage.low_confidence_tool_count + tool_surface.total_tools);
"legibility/completeness" is a rejected justification and a new field is a
schema bump. No code added.

No product surface added; no schema bump; full suite + ruff + schema-drift clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Addresses the P3 review finding on PR #223. The fixture
openai_agents_sdk_agent has low == total == 2, so `low >= threshold` holds for
ratio 0.3/0.5/0.7/1.0 alike — that assertion never catches a threshold edit, so
claiming (in CALIBRATION.md, the test docstring, and the constants comment)
that a threshold edit "surfaces in CI" was overstated.

Fix — two guards with distinct, accurately-stated jobs:
- Threshold edits: new `test_ie_threshold_constants_are_frozen`
  (tests/test_release_decision.py) asserts the constants equal 0.5 / 3.
  Mutation-checked: changing the ratio to 0.3 fails it. This is what actually
  makes a threshold edit surface in CI, forcing the deliberate recalibration
  path (update CALIBRATION.md + the labeling/re-mine prerequisites).
- Extraction regressions: the fixture test keeps its real job — re-running the
  one labeled IE case so an extraction improvement that resolves the dynamic
  surface flips the verdict and fails. Its docstring now says it does NOT guard
  threshold edits (the point sits at ratio 1.0, robust across (0,1]).

Reworded the matching claims in benchmark/miner/CALIBRATION.md and the
release_decision.py constants comment to point at the freeze guard for edits
and the fixture guard for extraction.

Full suite + ruff clean; no schema/behavior change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@pengfei-threemoonslab pengfei-threemoonslab merged commit 5b614ab into main Jun 16, 2026
2 checks passed
@pengfei-threemoonslab pengfei-threemoonslab deleted the claude/ie-threshold-calibration branch June 16, 2026 22:28
pengfei-threemoonslab added a commit that referenced this pull request Jun 17, 2026
* Deepen real-history mining: 2026-W26 over agent apps/toolkits

Continues the verdict-accuracy evidence base by mining agent
application/toolkit repos (the W25 lesson: framework cores yield ~0 decided).

Run: stripe/agent-toolkit + block/goose + pydantic/pydantic-ai, 40 PRs each
(120 rows). Results committed under benchmark/miner/results/2026-W26-mined.*
+ a blank labeling worksheet.

Findings:
- App/toolkit > cores, but thin: stripe/agent-toolkit gave 6 decided rows
  (15%); goose + pydantic-ai (a framework core) gave 0. The 6 are the first
  real `review_required` decided rows (cores only ever yielded IE), but they
  collapse to one pattern (a repeated automated skill-sync bot PR) — decided
  *diversity* added ≈1. Aggregate is now 9 repos / 361 PRs / 336 (93%)
  trigger-skip / 15 decided, and ZERO of the 15 are must_block — reconfirming
  must_block positives must come from the constructed stratum.
- Validates the #223 tools_scanned capture fix on REAL data: every W26 decided
  row records the ratio denominator (W24/W25 predate the fix, still null).
  Pinned by test_w26_headline_numbers_reproduce_from_committed_data; noted in
  CALIBRATION.md.
- Engine-robustness bug found (chipped, not fixed here): block/goose's OpenAPI
  spec crashes scan with `Config error: Duplicate action_surface action_id`
  (4 scan_failed) — action_id is method+path without operationId, so two ops
  on GET /sessions/{session_id} collide. A third-party spec must fail soft.

Corpus guard: W26 headline test added; cross-run trigger-skip floor raised
241 -> 361. Files LF-only. Full suite + ruff clean; no product/schema change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* W26 review: scored verdict is IE, not review_required (correct the framing)

Addresses the P2 review finding on PR #224. The accuracy scorer uses
`labels.effective_verdict` = `verify_verdict or head_decision`, and all 6 W26
decided rows have `verify_verdict=insufficient_evidence` (only the cold-start
`head_decision` is review_required). So W26 adds NO scored review_required
cases — the effective verdict mix stays IE-dominated. The earlier "first real
review_required / non-IE decided rows" framing was misleading.

- README: table note + W26 findings now state the cold-start head_decision is
  review_required but the verify-effective (scored) verdict is IE for all 6, so
  the scored mix is unchanged; net finding = even the best app/toolkit repo's
  decided rows are effectively IE once verified.
- test_w26_headline: assert BOTH head_decision == review_required AND
  effective_verdict == insufficient_evidence, so the docs can't drift back to
  the "non-IE" framing.

No data regenerated; verdicts unchanged. Full suite + ruff clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* W26 docs: correct helper name to labels.effective_verdict

Addresses the P3 review nit on PR #224. The README referenced
`labels._effective_verdict`; the actual helper (and the test) use
`effective_verdict` (no leading underscore). Doc-only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant