Skip to content

Design: tool-def deferral coverage (detect deferral gaps, restore/tune native deferral, proxy for non-native agents) #614

Description

@AVSRPA1KR

Follow-up to the acting-layer epic (#602). This is the deferral piece called out in the epic's non-goals as a separate, evidence-gated track. Design issue first, then PR, same flow the epic used.

Motivation

Tool-def deferral is no longer a missing capability on the happy path: Claude Code ships tool search enabled by default — MCP tool defs are deferred and discovered on demand, and the API mechanism behind it (defer_loading + a tool search tool) is GA on the Claude API, with the prefix untouched and prompt caching preserved by design (per current docs at https://code.claude.com/docs/en/mcp and https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool).

The real gap is coverage, and it has three parts:

  1. Deferral silently turns off. Claude Code disables tool search on Vertex AI and whenever ANTHROPIC_BASE_URL points to a non-first-party host (docs: "most proxies don't forward tool_reference blocks"). Users behind any gateway/proxy, or with a stale ENABLE_TOOL_SEARCH=false, are paying full tool-def cost on every turn with no signal that anything changed.
  2. Deferral config has tunable waste even when on. alwaysLoad: true servers that are rarely invoked, and threshold mode (auto:N) left at defaults, both leave measurable prefix cost on the table.
  3. Non-native agents have no deferral at all. Cursor, Cline, Aider and similar load every def every turn (support matrix to be confirmed as implementation step one).

None of this is routing. No model changes, no per-turn classification, no rewriting of anything the user wrote.

Captured evidence (real usage, Opus 4.x)

Measured deferred vs non-deferred on the same workloads, in separate sessions each:

  • 20-turn tool-heavy session: all defs loaded every turn $1.31 vs deferred $0.90 (~31% less, no model change).
  • Two-turn "chat → one tool lookup" flow: $0.96 non-deferred vs $0.35 deferred (~64% less).
  • Cache behavior matches the documented design: deferred defs are excluded from the prefix; a discovered tool comes back as a tool_reference block appended inline, so the cached head stays byte-stable.

Addressing the epic's evidence gate directly:

Gate Answer
Pre-turn classification accuracy Not applicable — deferral needs no classification. That concern belongs to routing, which stays out of scope here.
Cache-hit impact Positive by documented design (prefix untouched); acceptance criteria below make it measured per user, not asserted.
Auth mode Proxy is a local pass-through. Keys stay in the client's env/headers; the proxy never stores, logs, or re-signs credentials. Detail below.

Scope: three parts

1. Detection (analyzer-side, always safe)

A new optimize finding family, mcp-deferral-gaps:

  • Deferral-off detection. Deterministic transcript signal: sessions carrying meaningful MCP tool-def overhead with zero ToolSearch tool invocations across the window. Config corroboration where readable: ENABLE_TOOL_SEARCH=false in settings/env, a non-first-party ANTHROPIC_BASE_URL, Vertex AI, or a Claude Code version predating the feature. Each cause gets its own message, because the fixes differ.
  • alwaysLoad hygiene. Servers or tools pinned with alwaysLoad: true (server-level in .mcp.json, or "anthropic/alwaysLoad": true in a tool's _meta) whose observed call rate doesn't justify permanent prefix residency. Reuses the per-server schema-token overhead the mcp-low-coverage detector already computes. Threshold starts conservative (flag only clear cases, e.g. under 1 call per 5 sessions) and tunes on the detector's own data.
  • Threshold tuning. Where auto mode is in use, report whether the observed def volume suggests a tighter auto:N.
  • Output: per-server estimated tokens/session in the prefix, observed call rate, estimated savings, and the detected cause. Pure reporting; ships regardless of what follows.

2. Restore/tune native deferral (config-class, journaled)

Where the agent supports deferral natively, the fix is a config edit — exactly the class of action #603/#604 already handle. New plan kinds in src/act/plans.ts:

  • defer-enable: remove a stale ENABLE_TOOL_SEARCH=false (settings.json env or shell config), or set ENABLE_TOOL_SEARCH=true explicitly for a user behind a tool_reference-capable proxy (see part 3) who wants deferral back. Never set true blind behind an unverified proxy — the docs are explicit that requests fail on proxies that don't support tool_reference blocks, so the plan preview must state this and the finding must distinguish "proxy verified capable" from "unknown proxy".
  • defer-alwaysload: remove or add alwaysLoad: true on specific servers in .mcp.json, per the hygiene finding. Requires Claude Code v2.1.121+; the plan checks the installed version and marks itself not-appliable below it. Note alwaysLoad: true also blocks startup on that server's connection (5s cap) — the preview says so.
  • defer-threshold: set ENABLE_TOOL_SEARCH=auto:N in settings.json env.

All through runAction(): backup, journal, undo, dry-run preview. ENABLE_TOOL_SEARCH is read at process start, so every defer-* preview states that the change takes effect on the next session. Per #605's rule, verify the exact settings surface against live docs at implementation time; the field names above were verified against current docs but the protocol evolves.

3. Proxy fallback (separate opt-in component, for agents without native deferral)

For agents that can't defer natively, a small local proxy provides the same behavior using the GA API mechanism:

  • Request path: the proxy includes a tool search tool in the tools array and sets defer_loading: true on qualifying MCP tool defs, leaving at least the search tool non-deferred. Constraint from the docs, encoded as a test: a deferred tool cannot carry cache_control (API returns 400) — the cache breakpoint goes on a non-deferred tool.
  • Search executor: the API supports two equivalent executors for the search itself — the server-side variants (tool_search_tool_regex_20251119 / BM25), or a client-side search tool that returns tool_reference blocks in a standard tool_result (the documented custom path; Claude Code's own ToolSearch works this way). The defer_loading + tool_reference expansion mechanism is identical in both. Implementation starts with the server-side variants (smallest proxy surface, no inner round-trip to manage) with the executor kept pluggable, since switching to a proxy-executed search requires no protocol change.
  • Response path (the honest hard part): discovered tools come back as tool_reference blocks that non-native clients may not parse. The proxy maintains the per-conversation reference bookkeeping needed to keep replayed history coherent when the client drops or mangles those blocks. This is the compatibility bar Claude Code's own fallback implies — a proxy is only deferral-safe if it round-trips tool_reference correctly, and this proxy makes that its core competency. (openai/codex #23839 — namespaces dropped on replay after deferred discovery — is the reference failure mode.)
  • Model gating: deferral requires a model that supports tool_reference blocks (Sonnet 4.5+/Opus 4.5+/Haiku 4.5 per current docs). The proxy passes requests for unsupported models through untouched.
  • What it never does: no model rewriting, no param changes, no prompt mutation beyond the def relocation, no telemetry, no network calls except forwarding to the provider the client already targets.
  • Auth: pure pass-through. The client's Authorization/x-api-key headers are forwarded untouched; the proxy holds no credentials at rest and writes none to its logs.
  • Bonus interaction with part 1: a Claude Code user behind this proxy can set ENABLE_TOOL_SEARCH=true and keep native deferral, because the proxy forwards tool_reference — resolving the silent-off state that most gateways cause.
  • Packaging: a separately installed package, not a core subcommand — keeps the core's zero-proxy guarantee structurally intact, consistent with the epic's Class C rule that request-path code arrives only as a separate opt-in after design review. This issue is that design review.
  • Lifecycle: install/uninstall/status mirroring guard's conventions — journaled (kind: defer-proxy-install), prefix-tagged edits, uninstall restores byte-identical settings. Fail-open: if the proxy is down, the client's direct provider config must still work (documented fallback; no silent MITM dependency).

Implementation will be built in-repo, following the conventions the epic established. Implementation step one is compiling the agent support matrix: which agents expose native deferral today vs need the proxy.

Measurement

act report (#606) integration: defer-* actions capture a baseline (tool-def tokens/session, cache-write tokens/session, cache-hit rate over trailing 14 days) and report realized deltas post-apply, with the same confidence markers. Deferral's whole pitch is cache economics, so cache-hit rate before/after is a first-class reported number — if a deferral change ever hurts hit rate in practice, the report should say so.

Non-goals

  • No per-turn model routing, no request classification, no param rewriting. (Separate future track, gated on its own evidence.)
  • No reimplementing deferral where it exists natively — native path always preferred; the proxy is only for agents without it (or gateways that break it).
  • Nothing on by default. Detection findings are informational; both fix paths require explicit commands.
  • No cross-provider anything. Noting for the record: OpenAI ships its own tool search (GPT-5.4+), but Codex CLI's coverage is partial — general MCP tool defs still load upfront (openai/codex #14507, #9266) and the deferred discovery path has known replay bugs (#23839, dropped function_call namespaces). Different wire protocol, different cache economics (no write premium), so it stays out of this design. The #23839 failure mode is, incidentally, exactly the class of bug the proxy's response-path state handling is designed to prevent.

Acceptance

  • optimize surfaces deferral gaps with per-server evidence, detected cause, and estimated savings, with zero behavior change for users who ignore it.
  • A user whose deferral was silently off (proxy fallback / stale env) can see why, apply the fix where safe, and undo to a byte-identical config.
  • On a non-native agent, a user can install the proxy, run a real session with all tools still available and correct responses (tool_reference round-trip verified in tests), and act report shows measured tokens/session and cache-hit deltas after the window.
  • Uninstalling the proxy leaves the agent's provider config working with no manual repair.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions