Name the unbounded-toolkit blind spot: SHIP-SCOPE-TOOLKIT-UNBOUNDED (Phase 2a)#221
Merged
Merged
Conversation
…ot (Phase 2a) The dominant real-world insufficient_evidence cause — the 2026-06-01 Stripe pilot #232 and the W24/W25 mining, where decided real-history cases were IE-dominated — is a dynamically-loaded toolkit mounted with no configuration allowlist: the full toolkit surface (stripe_agent_toolkit refund/cancel/ dispute) the static extractor can't enumerate. Today that passes silently on a plain scan; the only signal, SHIP-VERIFY-CAPABILITY-SCOPE-BROADENED, is verify-tier and fires only on a base->head weakening of a known bound. New standard check (category `scope`) flags the unbounded bound's EXISTENCE on any scan, routing it to human review with a concrete recommendation instead of a silent pass. Complementary, not a duplicate: the verify-tier check needs a base bound to weaken from; this fires on the head's unbounded state directly. - checks/toolkit_bounds.py: SHIP-SCOPE-TOOLKIT-UNBOUNDED (high; requires human review regardless of patch — the allowlist is the team's call, never autofix). Reads context.toolkit_bounds for bounded=False (populated by the openai_sdk adapter; the bound model is provider-agnostic). - registry + scope.yaml metadata + checks.md/json + llms-full + the pinned override-set test updated. Validated end-to-end on the pilot's exact case: scanning tests/fixtures/stripe_pr232/head (full toolkit, no bound — the fixture comment literally says "mounted, silently") now surfaces the finding. Zero golden/ sample regression; the constructed accuracy benchmark is unchanged (no committed sample mounts an unbounded toolkit). Tests: unit (unbounded/bounded/ none) + the e2e head scan. Scopes to 2a (naming). Follow-up 2c moves the IE *rate*: elevate the decision out of insufficient_evidence to review_required when explained by this named finding, plus the extraction_coverage surface. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…erprint (PR #221 review) Review P2: the finding embedded `support_agent.py:<line>` in fingerprinted evidence and went through agent_finding(), whose source is shipgate.yaml. So SARIF located the finding at the manifest (the actionable constructor only buried in evidence), and a harmless line move (27→28) churned the fingerprint, thrashing baselines/accepted debt. Fix (matches the _framework_common dynamic-surface pattern): build the Finding directly with a code-location SourceReference (path + start_line at the constructor) so SARIF/report point at support_agent.py:27, and drop the line from evidence — finding_fingerprint hashes evidence only, not source, so the fingerprint is now stable across line moves while the surfaced location still reflects the real line. Tests: assert source.path/start_line carry the constructor location and evidence.source_ref is the bare filename; new fingerprint-stability test (line 27 vs 28 → identical fingerprint); the e2e Stripe-head scan asserts the finding's source points at support_agent.py, not shipgate.yaml. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 attacks the empirically-confirmed #1 real-world weakness: the dominant
insufficient_evidencecause is a dynamically-loaded toolkit mounted with no configuration allowlist — the full toolkit surface (stripe_agent_toolkitrefund/cancel/dispute) the static extractor can't enumerate. This is exactly what the 2026-06-01 Stripe pilot (#232) hit, and what the W24/W25 mining showed dominates the decided real-history cases.Today that passes silently on a plain scan — the only signal,
SHIP-VERIFY-CAPABILITY-SCOPE-BROADENED, is verify-tier and fires only on a base→head weakening of a known bound.This PR (2a) adds a standard check that flags the unbounded bound's existence on any scan, so it routes to human review with a concrete fix instead of disappearing into a silent pass / generic IE.
Validated on the pilot's exact case
Scanning
tests/fixtures/stripe_pr232/head— the full Stripe toolkit mounted with no bound, whose own source comment reads "No configuration bound: the full Stripe toolkit is mounted, silently" — now surfaces:…on a plain scan, not just a base→head diff.
Design
checks/toolkit_bounds.py→SHIP-SCOPE-TOOLKIT-UNBOUNDED(categoryscope, high,requires_human_review_regardless_of_patch— the allowlist is the team's call, never autofixable).context.toolkit_boundsforbounded=Falseentries (populated by the OpenAI SDK adapter; theToolkitScopeBoundmodel is provider-agnostic, so future toolkit providers are covered for free).Scope & follow-up
This is 2a (naming the blind spot) — deliberately not touching the core decision path. The decision stays
insufficient_evidencefor cases with other low-confidence signals, but it's now actionable; a clean silent-pass case escalates toreview_required. 2c (a follow-up) moves the IE rate: elevate the decision out of IE→review when it's explained by this named finding, plus theextraction_coveragesurface.Type
Verification
CI is authoritative for
python -m ruff check .,python -m compileall -q src tests, andpython -m pytest.docs/checks.{md,json}+llms-full.txtregenerated.blocked_recall=1.0 / benign_escalation=0.tests/test_toolkit_bounds_check.py: unit (unbounded fires / bounded silent / none) + an end-to-end scan of the Stripe head fixture.Release-readiness notes
docs/checks.md(added### SHIP-SCOPE-TOOLKIT-UNBOUNDED)STABILITY.md(new stable check ID; no schema bump)