Design: tool-def deferral coverage (detect deferral gaps, restore/tune native deferral, proxy for non-native agents)

Follow-up to the acting-layer epic (#602). This is the deferral piece called out in the epic's non-goals as a separate, evidence-gated track. Design issue first, then PR, same flow the epic used.
 
## Motivation
 
Tool-def deferral is no longer a missing capability on the happy path: Claude Code ships tool search **enabled by default** — MCP tool defs are deferred and discovered on demand, and the API mechanism behind it (`defer_loading` + a tool search tool) is GA on the Claude API, with the prefix untouched and prompt caching preserved by design (per current docs at https://code.claude.com/docs/en/mcp and https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool).
 
The real gap is coverage, and it has three parts:
 
1. **Deferral silently turns off.** Claude Code disables tool search on Vertex AI and whenever `ANTHROPIC_BASE_URL` points to a non-first-party host (docs: "most proxies don't forward `tool_reference` blocks"). Users behind any gateway/proxy, or with a stale `ENABLE_TOOL_SEARCH=false`, are paying full tool-def cost on every turn with no signal that anything changed.
2. **Deferral config has tunable waste even when on.** `alwaysLoad: true` servers that are rarely invoked, and threshold mode (`auto:N`) left at defaults, both leave measurable prefix cost on the table.
3. **Non-native agents have no deferral at all.** Cursor, Cline, Aider and similar load every def every turn (support matrix to be confirmed as implementation step one).

None of this is routing. No model changes, no per-turn classification, no rewriting of anything the user wrote.
 
## Captured evidence (real usage, Opus 4.x)
 
Measured deferred vs non-deferred on the same workloads, in separate sessions each:
 
- 20-turn tool-heavy session: all defs loaded every turn **$1.31** vs deferred **$0.90** (~31% less, no model change).
- Two-turn "chat → one tool lookup" flow: **$0.96** non-deferred vs **$0.35** deferred (~64% less).
- Cache behavior matches the documented design: deferred defs are excluded from the prefix; a discovered tool comes back as a `tool_reference` block appended inline, so the cached head stays byte-stable.

Addressing the epic's evidence gate directly:

| Gate | Answer |
|---|---|
| Pre-turn classification accuracy | Not applicable — deferral needs no classification. That concern belongs to routing, which stays out of scope here. |
| Cache-hit impact | Positive by documented design (prefix untouched); acceptance criteria below make it measured per user, not asserted. |
| Auth mode | Proxy is a local pass-through. Keys stay in the client's env/headers; the proxy never stores, logs, or re-signs credentials. Detail below. |
 
## Scope: three parts
 
### 1. Detection (analyzer-side, always safe)
 
A new optimize finding family, `mcp-deferral-gaps`:
 
- **Deferral-off detection.** Deterministic transcript signal: sessions carrying meaningful MCP tool-def overhead with zero `ToolSearch` tool invocations across the window. Config corroboration where readable: `ENABLE_TOOL_SEARCH=false` in settings/env, a non-first-party `ANTHROPIC_BASE_URL`, Vertex AI, or a Claude Code version predating the feature. Each cause gets its own message, because the fixes differ.
- **`alwaysLoad` hygiene.** Servers or tools pinned with `alwaysLoad: true` (server-level in `.mcp.json`, or `"anthropic/alwaysLoad": true` in a tool's `_meta`) whose observed call rate doesn't justify permanent prefix residency. Reuses the per-server schema-token overhead the `mcp-low-coverage` detector already computes. Threshold starts conservative (flag only clear cases, e.g. under 1 call per 5 sessions) and tunes on the detector's own data.
- **Threshold tuning.** Where `auto` mode is in use, report whether the observed def volume suggests a tighter `auto:N`.
- Output: per-server estimated tokens/session in the prefix, observed call rate, estimated savings, and the detected cause. Pure reporting; ships regardless of what follows.
### 2. Restore/tune native deferral (config-class, journaled)
 
Where the agent supports deferral natively, the fix is a config edit — exactly the class of action #603/#604 already handle. New plan kinds in `src/act/plans.ts`:
 
- **`defer-enable`**: remove a stale `ENABLE_TOOL_SEARCH=false` (settings.json `env` or shell config), or set `ENABLE_TOOL_SEARCH=true` explicitly for a user behind a tool_reference-capable proxy (see part 3) who wants deferral back. Never set `true` blind behind an unverified proxy — the docs are explicit that requests fail on proxies that don't support `tool_reference` blocks, so the plan preview must state this and the finding must distinguish "proxy verified capable" from "unknown proxy".
- **`defer-alwaysload`**: remove or add `alwaysLoad: true` on specific servers in `.mcp.json`, per the hygiene finding. Requires Claude Code v2.1.121+; the plan checks the installed version and marks itself not-appliable below it. Note `alwaysLoad: true` also blocks startup on that server's connection (5s cap) — the preview says so.
- **`defer-threshold`**: set `ENABLE_TOOL_SEARCH=auto:N` in settings.json `env`.

All through `runAction()`: backup, journal, undo, dry-run preview. `ENABLE_TOOL_SEARCH` is read at process start, so every `defer-*` preview states that the change takes effect on the next session. Per #605's rule, verify the exact settings surface against live docs at implementation time; the field names above were verified against current docs but the protocol evolves.
 
### 3. Proxy fallback (separate opt-in component, for agents without native deferral)
 
For agents that can't defer natively, a small local proxy provides the same behavior using the GA API mechanism:
 
- **Request path:** the proxy includes a tool search tool in the `tools` array and sets `defer_loading: true` on qualifying MCP tool defs, leaving at least the search tool non-deferred. Constraint from the docs, encoded as a test: a deferred tool cannot carry `cache_control` (API returns 400) — the cache breakpoint goes on a non-deferred tool.
- **Search executor:** the API supports two equivalent executors for the search itself — the server-side variants (`tool_search_tool_regex_20251119` / BM25), or a client-side search tool that returns `tool_reference` blocks in a standard `tool_result` (the documented custom path; Claude Code's own `ToolSearch` works this way). The `defer_loading` + `tool_reference` expansion mechanism is identical in both. Implementation starts with the server-side variants (smallest proxy surface, no inner round-trip to manage) with the executor kept pluggable, since switching to a proxy-executed search requires no protocol change.
- **Response path (the honest hard part):** discovered tools come back as `tool_reference` blocks that non-native clients may not parse. The proxy maintains the per-conversation reference bookkeeping needed to keep replayed history coherent when the client drops or mangles those blocks. This is the compatibility bar Claude Code's own fallback implies — a proxy is only deferral-safe if it round-trips `tool_reference` correctly, and this proxy makes that its core competency. (openai/codex #23839 — namespaces dropped on replay after deferred discovery — is the reference failure mode.)
- **Model gating:** deferral requires a model that supports `tool_reference` blocks (Sonnet 4.5+/Opus 4.5+/Haiku 4.5 per current docs). The proxy passes requests for unsupported models through untouched.
- **What it never does:** no model rewriting, no param changes, no prompt mutation beyond the def relocation, no telemetry, no network calls except forwarding to the provider the client already targets.
- **Auth:** pure pass-through. The client's `Authorization`/`x-api-key` headers are forwarded untouched; the proxy holds no credentials at rest and writes none to its logs.
- **Bonus interaction with part 1:** a Claude Code user behind this proxy can set `ENABLE_TOOL_SEARCH=true` and keep native deferral, because the proxy forwards `tool_reference` — resolving the silent-off state that most gateways cause.
- **Packaging:** a separately installed package, not a core subcommand — keeps the core's zero-proxy guarantee structurally intact, consistent with the epic's Class C rule that request-path code arrives only as a separate opt-in after design review. This issue is that design review.
- **Lifecycle:** `install`/`uninstall`/`status` mirroring `guard`'s conventions — journaled (kind: `defer-proxy-install`), prefix-tagged edits, uninstall restores byte-identical settings. Fail-open: if the proxy is down, the client's direct provider config must still work (documented fallback; no silent MITM dependency).

Implementation will be built in-repo, following the conventions the epic established. Implementation step one is compiling the agent support matrix: which agents expose native deferral today vs need the proxy.
 
## Measurement
 
`act report` (#606) integration: `defer-*` actions capture a baseline (tool-def tokens/session, cache-write tokens/session, cache-hit rate over trailing 14 days) and report realized deltas post-apply, with the same confidence markers. Deferral's whole pitch is cache economics, so cache-hit rate before/after is a first-class reported number — if a deferral change ever *hurts* hit rate in practice, the report should say so.
 
## Non-goals
 
- No per-turn model routing, no request classification, no param rewriting. (Separate future track, gated on its own evidence.)
- No reimplementing deferral where it exists natively — native path always preferred; the proxy is only for agents without it (or gateways that break it).
- Nothing on by default. Detection findings are informational; both fix paths require explicit commands.
- No cross-provider anything. Noting for the record: OpenAI ships its own tool search (GPT-5.4+), but Codex CLI's coverage is partial — general MCP tool defs still load upfront (openai/codex #14507, #9266) and the deferred discovery path has known replay bugs (#23839, dropped function_call namespaces). Different wire protocol, different cache economics (no write premium), so it stays out of this design. The #23839 failure mode is, incidentally, exactly the class of bug the proxy's response-path state handling is designed to prevent.
## Acceptance
 
- `optimize` surfaces deferral gaps with per-server evidence, detected cause, and estimated savings, with zero behavior change for users who ignore it.
- A user whose deferral was silently off (proxy fallback / stale env) can see why, apply the fix where safe, and undo to a byte-identical config.
- On a non-native agent, a user can install the proxy, run a real session with all tools still available and correct responses (tool_reference round-trip verified in tests), and `act report` shows measured tokens/session and cache-hit deltas after the window.
- Uninstalling the proxy leaves the agent's provider config working with no manual repair.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design: tool-def deferral coverage (detect deferral gaps, restore/tune native deferral, proxy for non-native agents) #614

Motivation

Captured evidence (real usage, Opus 4.x)

Scope: three parts

1. Detection (analyzer-side, always safe)

2. Restore/tune native deferral (config-class, journaled)

3. Proxy fallback (separate opt-in component, for agents without native deferral)

Measurement

Non-goals

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gate	Answer
Pre-turn classification accuracy	Not applicable — deferral needs no classification. That concern belongs to routing, which stays out of scope here.
Cache-hit impact	Positive by documented design (prefix untouched); acceptance criteria below make it measured per user, not asserted.
Auth mode	Proxy is a local pass-through. Keys stay in the client's env/headers; the proxy never stores, logs, or re-signs credentials. Detail below.

Uh oh!

Design: tool-def deferral coverage (detect deferral gaps, restore/tune native deferral, proxy for non-native agents) #614

Description

Motivation

Captured evidence (real usage, Opus 4.x)

Scope: three parts

1. Detection (analyzer-side, always safe)

2. Restore/tune native deferral (config-class, journaled)

3. Proxy fallback (separate opt-in component, for agents without native deferral)

Measurement

Non-goals

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions