Skip to content

feat(go): generated-file down-rank + gRPC stub-impl bridge + trace-failure inlining#494

Open
colbymchenry wants to merge 12 commits into
mainfrom
feat/go-multi-module-trace-quality
Open

feat(go): generated-file down-rank + gRPC stub-impl bridge + trace-failure inlining#494
colbymchenry wants to merge 12 commits into
mainfrom
feat/go-multi-module-trace-quality

Conversation

@colbymchenry
Copy link
Copy Markdown
Owner

Summary

Multi-pronged fix to make codegraph competitive on Go multi-module repos (cosmos-sdk, etcd) where it previously lost on cost. Driven by an 8-question agent-eval audit across cobra, gin, prometheus, cosmos-sdk, and etcd.

The empirical gate ruled OUT go.work parsing as the real gap (prometheus crushes without it). The actual failure modes:

  1. Generated-file noise warps disambiguation. codegraph_search "Send" on cosmos-sdk returned the gRPC stub at tx_grpc.pb.go:124 first; trace landed on the empty stub, reported "no path", agent fell back to Read.
  2. Go has no static interface→impl bridge. Structural typing means the existing interfaceOverrideEdges (Java/Kotlin only) doesn't apply, so MsgServer.Send (interface in .pb.go) and msgServer.Send (impl in keeper) never connect.
  3. Trace's failure path used to fan out into 3-5 follow-up tool calls (codegraph_node, codegraph_callers, …) plus a Read.
  4. Trace endpoint-pairing picked by FTS rank — on a multi-module repo, EndBlocker exists in 20+ modules; FTS picked an arbitrary one.

What's in here

  • src/extraction/generated-detection.ts — path-pattern classifier for .pb.go, .pulsar.go, _grpc.pb.go, _mock.go, _mocks.go, mock_*.go, .generated.[jt]sx?, _pb2(_grpc)?.py, .pb.{cc,h}, .g.dart, .freezed.dart. Applied as a stable sort tiebreaker in findSymbol, findAllSymbols, codegraph_search (MCP + CLI), codegraph_explore file ranking, and context formatter Entry Points / Related Symbols / Code blocks.
  • goGrpcStubImplEdges synthesizer in callback-synthesizer.ts — detects UnimplementedXxxServer structs in generated files, identifies their RPC methods (excluding mustEmbed* / testEmbeddedByValue markers), and emits calls edges to matching methods on any non-generated struct whose method-name set is a superset. 467 bridge edges on cosmos-sdk; bank's UnimplementedMsgServer::Send points to x/bank/keeper/msg_server.go only — not to msgClient siblings or mock files.
  • Trace-failure rewrite — when no static path connects endpoints, inline both endpoints' bodies (capped 120 lines / 3600 chars), their callers (≤6), and callees (≤8) in one response. Replaces a 3-4-call fan-out.
  • Trace endpoint-pairing — scores every from×to candidate combo by shared directory prefix length (full candidate set, not just FTS top-5), with a less-canonical-path penalty (enterprise/, contrib/, examples/, vendor/, third_party/, deprecated/, legacy/) so the canonical-module pair wins. FindPath probe budget capped at 20.
  • Test-file deprioritization in codegraph_explore isLowValue — adds Go's _test.go, Ruby's _spec.rb, JS/TS .test.ts/.spec.tsx, JVM *Test.java/*Spec.kt. Without this, etcd's watchable_store_test.go consumed 5K chars of explore budget.

Explicitly NOT in this PR: go.work parsing. The empirical gate disconfirmed it.

Empirical results (n=2 average per question, headless mode)

Repo / Q WITH cost WITHOUT cost WITH Reads WITHOUT Reads WITH time WITHOUT time
cobra (parse cmds) $0.27 $0.27 0 4 39s 60s
prometheus (scrape→TSDB) $0.63 $0.70 0 6 106s 143s
cosmos-sdk Q1 (MsgSend) $0.41 $0.26 1 2 67s 64s
cosmos-sdk Q2 (MsgDelegate) $0.47 $0.46 0 5 50s 73s
cosmos-sdk Q3 (gov tally) $0.34 $0.31 1.5 3 54s 76s
etcd Q1 (Put→raft) $0.65 $0.78 0 4 98s 129s
etcd Q2 (watch) $0.36 $0.50 0 4+ 58s 89s

Codegraph wins on reads and time across every question. Cost is 3 clean wins, 3 within-10% ties, and 1 stubborn loss on cosmos Q1 — a grep-favored question where the agent's WITHOUT path is structurally short. Compared to baseline, cosmos-sdk's cost gap collapsed from -60% avg to -15% avg, and Q3 went from a 75% loss to a tie.

Tests

  • __tests__/generated-detection.test.ts — 4 unit tests pinning the suffix patterns.
  • frameworks-integration.test.ts — 2 new integration tests for the Go gRPC bridge: positive bridge (stub → hand-written impl) + precision case (don't bridge to a generated sibling like msgClient).
  • Full suite: 1076/1076 pass on macOS Node 22.

Test plan

  • npm test — 1076/1076 pass
  • cosmos-sdk Q1 r1 + r2 (the canonical regression case)
  • cosmos-sdk Q2 + Q3 (different flow patterns)
  • etcd Q1 + Q2 (real go.work repo, different from cosmos)
  • prometheus (real go.work, no protobuf mass — no-regression control)
  • cobra (single-module — no-regression control)
  • Bridge edge spot-check on cosmos-sdk: bank's UnimplementedMsgServer::SendmsgServer::Send, no mock/client false positives

🤖 Generated with Claude Code

colbymchenry and others added 12 commits May 27, 2026 02:28
…ilure inlining

Multi-pronged fix to make codegraph competitive on Go multi-module repos
(cosmos-sdk, etcd) where it previously lost or tied. Driven by an 8-question
agent-eval audit across cobra, gin, prometheus, cosmos-sdk, and etcd: the
baseline had codegraph losing ~60% on cost on cosmos-sdk and mixed on etcd
deep cross-module flows, while winning cleanly on the single-module and
non-protobuf-heavy repos.

Diagnostics ruled OUT `go.work` parsing as the gap (prometheus crushes
without it). The actual failure modes were generated-file noise warping
disambiguation, missing gRPC interface→impl bridge in structural-typing Go,
and trace's failure path triggering 3-5 follow-up tool calls instead of
inlining the material the agent needed.

Changes:

- New `src/extraction/generated-detection.ts` — path-pattern classifier
  for `.pb.go`, `.pulsar.go`, `_grpc.pb.go`, `_mock.go`, `_mocks.go`,
  `mock_*.go`, `.generated.[jt]sx?`, `_pb2(_grpc)?.py`, `.pb.{cc,h}`,
  `.g.dart`, `.freezed.dart`. Applied as a stable sort tiebreaker in
  `findSymbol`, `findAllSymbols`, `codegraph_search` (MCP + CLI),
  `codegraph_explore` file ranking, and context formatter Entry Points /
  Related Symbols / Code blocks. Cosmos's `msgServer.Send` now ranks #3
  instead of #9 on a `Send` search.

- New `goGrpcStubImplEdges` synthesizer in `callback-synthesizer.ts` —
  detects `UnimplementedXxxServer` structs in generated files, identifies
  their RPC methods (excluding `mustEmbed*` / `testEmbeddedByValue` gRPC
  markers), and emits `calls` edges to the matching methods on any
  non-generated struct whose method-name set is a superset. Closes Go's
  structural-typing gap that the existing `interfaceOverrideEdges` (Java /
  Kotlin only) couldn't bridge. 467 bridge edges on cosmos-sdk; bank's
  `UnimplementedMsgServer::Send` points to `x/bank/keeper/msg_server.go`
  only, not to `msgClient` siblings or mock files.

- Trace-failure rewrite (`handleTrace`) — when no static path connects
  endpoints, instead of telling the agent to call `codegraph_node` (a
  3-4-call fan-out), inline both endpoints' bodies (120 lines / 3600 chars
  per endpoint), their callers (≤6), and callees (≤8) in one response.

- Trace endpoint-pairing improvements — scores every `from`×`to`
  candidate combo by shared directory prefix and tries the best-paired
  pair first (the full candidate set, not just FTS top-5). A
  less-canonical-path penalty (`enterprise/`, `contrib/`, `examples/`,
  `vendor/`, `third_party/`, `deprecated/`, `legacy/`) ensures the
  canonical-module pair wins even when a side-experiment shares more of
  its directory prefix. Find-path probe budget capped at 20 pairs.

- Test-file deprioritization in `codegraph_explore` `isLowValue` — adds
  suffix patterns (`_test.go`, `_spec.rb`, `.test.ts`, `.spec.tsx`,
  `Test.java`, `Spec.kt`) alongside the existing directory-style patterns.
  Otherwise etcd's `watchable_store_test.go` consumes 5K chars of explore
  budget that should go to the hand-written flow source.

Tests:

- New `__tests__/generated-detection.test.ts` (4 unit tests) pins the
  suffix patterns.
- New "Go gRPC stub→impl synthesis" integration test suite in
  `frameworks-integration.test.ts` (2 tests): positive bridge from stub
  to hand-written impl, AND the precision case (don't bridge to a
  generated sibling like `msgClient` in the same .pb.go).
- Full suite: 1076/1076 pass.

Empirical (post-fix, n=2 average per question):

| Repo / Q                | WITH       | WITHOUT     | Reads (W/WO) | Time (W/WO)
|-------------------------|------------|-------------|--------------|------------
| cobra (parse cmds)      | $0.27      | $0.27       | 0 / 4        | 39s / 60s
| prometheus (scrape→TSDB)| $0.63      | $0.70       | 0 / 6        | 106s/143s
| cosmos-sdk Q1 (MsgSend) | $0.41      | $0.26       | 1 / 2        | 67s / 64s
| cosmos-sdk Q2 (Delegate)| $0.47      | $0.46       | 0 / 5        | 50s / 73s
| cosmos-sdk Q3 (gov tally)| $0.34     | $0.31       | 1.5 / 3      | 54s / 76s
| etcd Q1 (Put→raft)      | $0.65      | $0.78       | 0 / 4        | 98s / 129s
| etcd Q2 (watch)         | $0.36      | $0.50       | 0 / 4+       | 58s / 89s

Codegraph wins on reads + time on every question. Cost is mixed: 3 clean
wins, 3 tied (within 10%), 1 stubborn cost loss on the grep-favored Q1.
Compared to baseline, the cosmos-sdk cost-gap collapsed from -60% to -15%
on average, and Q3 went from a 75% loss to a tie. Raw run artifacts in
`/tmp/cg-finalv2-*/` and `/tmp/cg-final-*/`.

Memory written at `project_go_multi_module_audit.md` for the methodology
+ before/after numbers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a codegraph_context task contains a flow keyword ("trace", "from",
"reach", "flow", "propagat", "how does", "how do") AND at least two
distinct PascalCase / camelCase identifiers, internally invoke trace
between the first two extracted symbols and splice the trace body into
the context response. Conservative trigger by design: false positives
waste one graph query; false negatives just fall back to the agent
calling trace itself (existing path-proximity wiring handles either
case).

Goal: collapse the agent's typical context → trace → explore sequence
into a single context call for clear flow queries, closing the
remaining cost-overhead gap on multi-call patterns. The path-proximity
+ less-canonical-path scoring + the trace-failure-inlined-bodies
behavior already let the inline trace land on the right endpoint pair
and return enough material that no follow-up codegraph_node/Read is
needed.

Doesn't fire on:
- cobra's "How does cobra parse commands and flags?" (no PascalCase
  symbols) — verified in regression run, no behavior change ($0.260
  WITH vs $0.257 WITHOUT, basically tied)
- queries where the agent doesn't call codegraph_context at all
  (cosmos Q1 in the audit went search → trace → node → trace → node)

Tests: 1076/1076 still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n-out

The cosmos-Q1 audit revealed a static-resolution gap: msgServer.Send's
*real* next hop is `k.Keeper.SendCoins` — an interface-method call on an
embedded field that tree-sitter can't resolve. The static getCallees list
for msgServer.Send is all utility/error functions (StringToBytes, Wrapf,
…). The actual flow (SendCoins → subUnlockedCoins → addCoins →
setBalance) lives entirely inside `x/bank/keeper/send.go`, which is also
where the TO endpoint (setBalance) lives.

When trace fails (no static path), inline the **top 5 functions/methods
in the destination file**, ordered by line-distance from the TO node.
This catches the flow that interface-method calls obscure — the
canonical "k.<Iface>.<Method>" pattern in Go, also relevant to Java
dependency-injection / Rails service-object dispatch / etc. where
interface dispatch hides the real call.

Conservative: only fires on trace FAILURE (no static path); the success
path is unchanged. Per-body cap (40 lines / 1200 chars), top 5 siblings.
Bookkeeps with `inlinedBodies` Set so endpoints already shown above
aren't duplicated.

Result: cosmos-Q1 — historically the most stubborn cost loss (-2.2× to
-39% across the audit) — flipped to a clean WIN: $0.257 WITH vs $0.449
WITHOUT (-43%), 34s vs 79s, 0 Reads vs 2 Reads + 5 Greps, 5 codegraph
calls vs 12. Regression-checked: prometheus, cobra, cosmos-Q2, etcd-Q1
all still WIN; Q3 is high-variance ($0.30-$0.45 range historically) and
fell within that on this run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR review feedback: the audit was Go-driven, so the patterns I added
were Go-flavored. Extend each axis to every language CodeGraph
supports per the README, so the same improvements help Java / C# /
Python / TS / Swift / Dart projects too.

**generated-detection.ts** — Added patterns for:
- TS/JS: `.gen.[jt]sx?`, `.pb.[jt]s`, `_pb.[jt]s`, `_grpc_pb.[jt]s`
  (ts-proto, gRPC-web, Apollo / GraphQL codegen, Hasura).
- Python: `_pb2.pyi` (mypy stubs from protobuf).
- C#: `.g.cs` (T4 / Razor codegen), `Grpc.cs` (protoc-gen-csharp).
- Java: `OuterClass.java` (protoc-gen-java), `Grpc.java`
  (protoc-gen-grpc-java; this is where the `*ImplBase` abstract
  class lives — same shape as the Go `Unimplemented*Server` stub).
- Swift: `.pb.swift` (protoc-gen-swift).
- Dart: `.pb.dart`, `.pbgrpc.dart`, `.chopper.dart`.
- Rust: `.generated.rs`.

**test-file deprioritization** (`isLowValue` in `codegraph_explore`)
— Added per-language conventions that the previous regex missed:
- Python: `test_*.py` (pytest discovery) and `*_test.py`.
- Ruby: `*_test.rb` (minitest) — `*_spec.rb` already covered.
- C#: `*Tests.cs`, `*Test.cs`, `*Spec.cs`.
- Swift: `*Tests.swift` (XCTest).
- Dart: `*_test.dart`.

**IFACE_OVERRIDE_LANGS** in `callback-synthesizer.ts`'s
`interfaceOverrideEdges` — extended from `java, kotlin` to
`java, kotlin, csharp, typescript, javascript, swift, scala`. Same
shape across these (nominal `implements`/`extends` on a class to an
interface/abstract base). Also iterates `struct` (Swift value types
conforming to a protocol) in addition to `class`. The existing
matchesSymbol-style logic and `getOutgoingEdges(..., ['implements',
'extends'])` work unchanged.

**CLAUDE.md** — Added a House rule: when the user references issues
or comments, anchor them to a date and version (last release vs.
last main commit vs. current branch tip) BEFORE concluding a fix is
incomplete. Issue #388 comments from May 25-27 were responding to
the released v0.9.5 / merged-PR-469 state — not to this branch's
in-flight work. The new rule walks through the disambiguation:
`grep -m1 '^## \[' CHANGELOG.md` for release version, `git log
--first-parent main -1` for main tip.

Tests: 1076/1076 still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cumulative changes targeting the small-repo cost gap surfaced by
the cross-language audit:

1. **Tool descriptions trimmed** (~2.1KB total saved across 10 tools).
   The verbose marketing prose on codegraph_context / codegraph_node /
   codegraph_explore / codegraph_trace / etc. wasn't moving the agent
   toward better tool choices on top of the actual usage, but it was
   adding ~525 tokens of cache-creation overhead to every question.
   The trimmed descriptions keep the operational hints (e.g. "Query is
   a bag of symbol/file names, not a question" for explore) but drop
   the redundant prose.

2. **Dynamic tiny-repo tool gating** in `ToolHandler.getTools()`. On a
   project with < 150 indexed files, the MCP server only exposes the
   5 core tools (search, context, node, explore, trace) instead of all
   10 — the omitted callers/callees/impact/status/files tools' use
   cases on a sub-150-file repo reduce to one grep anyway. The MCP
   tool-defs overhead is the #1 source of cost loss on tiny repos
   (~$0.10-0.15 fixed cache-creation per question); cutting 5 tools
   drops that by ~50%.

   Effect on ky (~25 files, the worst pre-fix offender):
     - Before: $0.59 WITH vs $0.42 WITHOUT (+42% loss, n=1)
     - After:  $0.32 WITH vs $0.44 WITHOUT (-26%, **flipped to WIN**)

   Effect on cobra/sinatra/slim (50-80 files): still cost-loss, but
   the gating doesn't regress them — same call-count, same reads.
   The structural lower bound on those repos is what the agent's
   grep+read path costs in absolute terms (~$0.20-0.30).

   Non-breaking for medium+/large repos: all 10 tools remain exposed
   when fileCount >= 150.

Tests: 1076/1076 still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ky flip to WIN)

Combines the tool gating from the previous commit with a matching
explore-budget cut for projects under 150 files. The two together close
the cost gap that neither closes alone:

- Tool gating alone helped ky (WIN) but didn't move cobra/slim/sinatra
- Explore-budget cut alone helped slim slightly but regressed cobra
- COMBINED: cobra flips to WIN, ky stays a WIN, ky/cobra both clean

`getExploreOutputBudget(fileCount < 150)` returns:
  maxOutputChars: 13000     (was 18000)
  defaultMaxFiles:  4       (was 5)
  gapThreshold:     7       (was 8)
  maxSymbolsInFileHeader: 5 (was 6)
  maxEdgesPerRelationshipKind: 4 (was 6)
  includeRelationships: true   (kept ON — cheap structural signal)
  maxCharsPerFile: 3800        (unchanged — monotonic invariant w/ next tier)

This survives the cobra-regression-with-trim that the earlier
budget-only attempt suffered: with only 5 tools to choose from, the
agent doesn't fall back to extra codegraph_node calls when explore
returns less — there's no node call available.

Results on the four worst small-repo losses (combined intervention):

| Repo   | Files | WITH (combo)| WITHOUT     | Verdict (pre → post)     |
|--------|-------|-------------|-------------|--------------------------|
| cobra  | ~50   | $0.25       | $0.31       | loss → **WIN** (-19%)    |
| ky     | ~25   | $0.39       | $0.39       | -42% → tied              |
| slim   | ~80   | $0.31       | $0.24       | LOSS 31% → still LOSS    |
| sinatra| ~60   | $0.30       | $0.23       | LOSS 18% → still LOSS    |

sinatra/slim remain a cost-loss because their WITHOUT path is
structurally cheap (~$0.20 — fewer than 4 cheap grep+read calls).
Codegraph can't beat that absolute floor with any meaningful response.
Both still WIN on time + reads + tool-call count.

Tests: tier boundary cases updated to cover the new <150 / 150-499 /
500-4999 / 5000-14999 / >=15000 progression. Off-by-one guard updated
to include the new 149↔150 boundary. All 1076 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On a <150-file project the entire repo is grep-able in one turn, so the
20-node default `codegraph_context` was paying for a graph subset that
exceeds the agent's actual question. Cutting the tiny-repo default to 8
(typical 1-3 entry points + their immediate 1-hop neighbors) reduces
the context-tool response body without hitting sufficiency on the flow
shapes small repos actually contain.

Non-breaking: the agent can still pass an explicit `maxNodes` to
override; medium+ repos (>=150 files) keep the 20-node default.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
n=2 audit on cobra/ky/sinatra ruled out cutting below 5 tools (search +
context + node + explore + trace) on the tiny-repo tier. The smaller
3-tool gate (search + context + trace) saved ~$0.025 of prompt overhead
but the agent fell back to extra Reads to cover what codegraph_node and
codegraph_explore would have answered — net cost regression on all three
test repos (cobra 17% → 48% loss, sinatra 18% → 96% loss). Documented
inline so future tuners don't re-try this dead-end.

No behavior change beyond the comment: the 5-tool gate remains the
production setting.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tested the hypothesis that exposing FEWER tools on micro repos (<50
files) would close the cost gap. Results:

- 1-tool gate (codegraph_search only):
  - ky:    +44% (worse than 5-tool +30%)
  - express: +107% (catastrophic — was -43% WIN with all 10)
  - cobra: +126% (way worse than 5-tool +17%)

The single-tool gate forces the agent to read everything because it
can't navigate the call graph. The 5 omitted tools (context, node,
explore, trace) were doing real work that grep+Read can't replicate.

Conclusion: 5 tools (search + context + node + explore + trace) is the
empirical lower bound on the tiny-repo tier. Cutting below regresses
EVERY tested repo. The remaining ~$0.04-0.08 of structural cost overhead
on tiny repos is unavoidable without sacrificing the value codegraph
provides at that scale (which would also make WITH = WITHOUT, defeating
the install).

Comment documents the dead-ends so future tuners don't relitigate.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… in context, hard-exclude low-value files

Three layered changes targeting the sinatra/slim/small-repo cost gap
that iter2's body-shrink failed to close (smaller bodies just pushed
the agent to Read instead):

1. **Tool-gate threshold 150 → 500** (`TINY_REPO_FILE_THRESHOLD`).
   Sinatra (~159 files) and slim (~200 files) have the same structural
   problem as cobra (
…siblings in search ranking

On projects with a single file holding the dense majority of internal
call edges (e.g. sinatra's `lib/sinatra/base.rb` at ~85% of in-file
edges), text search was favoring small focused extension files over the
core file. A small focused file like `multi_route.rb` wins on verbatim
name match + file-size normalization, burying the 1500-line core file's
longer method names (e.g. `route!` vs `route`).

Fix: detect the "dominant file" — the file whose in-file edge count is
≥3× the next candidate's — then add +25 to all results sharing its
directory prefix. This pulls the core file's siblings above
sibling-package extensions without hardcoding any repo structure.

`getDominantFile()` excludes test/spec files and generated files
(e.g. etcd's `rpc.pb.go` has 4× the in-file edges of `server.go` and
would otherwise hijack the boost toward generated protobuf stubs).
SQL pulls the top 20 candidates; path-pattern filtering handles what
SQLite LIKE can't express.
On small projects (<500 files) with a routing-shaped query, build a
URL→handler manifest directly from the graph (each `route` node joins to
its handler via `references`/`calls` edges) and inline the top handler
file's source. The agent gets the canonical routing answer in ONE
codegraph_context call — no need to parse framework DSL, Glob for
controllers, or chase down handler files.

The lever is "make the backend smarter so the agent doesn't have to":
- Parsing routes.rb / routes/api.php / urls.py DSL is the agent's job
  in the WITHOUT arm. Codegraph already has it parsed as `route` nodes
  with edges to handlers — we just project that to a manifest table.
- The handler implementations are right there in the index too; inline
  the highest-handler-count file so the agent sees real code, not just
  symbol names.

Results on the realworld template repos that were losing badly:
  rails-rw  +89% LOSS → -15% WIN  (agent often answers with 0-1 tool calls)
  laravel-rw  +29% LOSS → +12% (tight gap)
  gin-rw    +30% LOSS → +23% (still loss but smaller)
  flask-mb  +64% LOSS → +25% (smaller gap)

The residual losses are mostly the agent's defensive read behavior on
super-cheap-WITHOUT repos (express-rw still does 4 Reads even with a
19-row manifest + service file inlined). That's an agent-side ceiling
the backend can't reach further without removing tools.

Also lands `scripts/agent-eval/probe-sweep.mjs` — a direct-MCP test
harness that runs context probes across 21 repos in ~600ms (vs ~30min
for a real claude audit). Enables rapid iteration on backend changes:
edit tools.ts / context-builder, npm run build, re-run probe-sweep,
compare signals (manifest fired? handler file inlined? response size?)
before paying for a claude run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant