diff --git a/.claude/handoffs/codegraph-tool-surface-rethink-2026-05-27.md b/.claude/handoffs/codegraph-tool-surface-rethink-2026-05-27.md new file mode 100644 index 000000000..398e783d5 --- /dev/null +++ b/.claude/handoffs/codegraph-tool-surface-rethink-2026-05-27.md @@ -0,0 +1,114 @@ +--- +name: codegraph-tool-surface-rethink-2026-05-27 +date: 2026-05-27 15:11 +project: codegraph +branch: feat/go-multi-module-trace-quality +summary: PR #494 multi-language audit revealed structural ~$0.04-$0.08 tiny-repo cost overhead from MCP tool-defs; user pivoted to questioning whether codegraph_context / 5+ tools are even necessary — suggested `explore` + `trace` only. +--- + +# Handoff: Should codegraph cut to just `explore` + `trace`? + +## Resume here — read this first +**Current state:** PR #494 (`feat/go-multi-module-trace-quality`, 13 commits, all 1076 tests pass) ships every safe optimization for the cosmos/etcd Go work AND the cross-language extensions (generated-detection, IFACE_OVERRIDE_LANGS, sibling-inlining, path-proximity, tool gating at <150 files to 5 core tools). Empirically PROVED that cutting below 5 tools regresses every tiny repo (3-tool gate: cobra 17→48% loss; 1-tool gate: express -43% WIN flipped to +107% LOSS). User just asked the right question: **"Why do we need codegraph_context, or any of these massive amounts of tools? All it really needs is explore, and trace if you ask me."** + +**Immediate next step:** Open the next session by treating the user's question as a design pivot, not a continuation of the cost-gap whack-a-mole. The right reply is a focused honest analysis: what does each of the 10 tools actually do that explore + trace alone can't, where does codegraph_context's value-add hold up (or not), and what would removing context/search/node from the default surface ACTUALLY cost in measured loss-of-flow-coverage. Don't start cutting tools yet — present the analysis first. + +> Suggested next message: "Walk me through what each codegraph_* tool actually does on a real flow question that explore + trace alone can't, and which ones agents are picking in our recent audits. If context/search/node aren't earning their seat, propose cutting them and measure on cosmos-Q1 + etcd-Q1 + prometheus + cobra n=2 each." + +## Goal +Decide whether codegraph's 10-tool MCP surface should be cut down to ~2 core tools (explore + trace) as the user proposed. The empirical iteration in this session showed that the 5 omitted "auxiliary" tools (callers, callees, impact, status, files) only add cost on tiny repos and aren't earning their seat. The real question now: **does the same logic apply to context + search + node?** If yes, codegraph becomes 2 tools + a smaller MCP surface = lower fixed prompt overhead = closes the tiny-repo cost gap structurally instead of patching it. If no, name the specific flows where they do unique work. + +## Key findings (this session) + +- **PR #494 status**: 13 commits, all 1076 tests pass, https://github.com/colbymchenry/codegraph/pull/494. Already pushed: + - Generated-file detection: `src/extraction/generated-detection.ts` (multi-language patterns, applied in `findSymbol`/`findAllSymbols`/`handleSearch`/`handleExplore` file ranking/`context/formatter.ts`) + - Go gRPC bridge: `goGrpcStubImplEdges` in `src/resolution/callback-synthesizer.ts:341` (467 bridge edges on cosmos-sdk) + - Trace failure inlining + path-proximity pairing + less-canonical-path penalty + sibling-from-TO-file inlining: all in `src/mcp/tools.ts` `handleTrace` + - `IFACE_OVERRIDE_LANGS` extended from `{java,kotlin}` to `{java,kotlin,csharp,typescript,javascript,swift,scala}`; loop iterates `class` AND `struct` kinds + - Tool-def trims (~7KB → 5KB) in `src/mcp/tools.ts` + - Tiny-repo tool gating: `ToolHandler.getTools()` filters to 5 core tools when `fileCount < 150` + - Tiny-tier explore budget in `getExploreOutputBudget(fileCount < 150)`: 13K total / 4 files / `includeRelationships: true` + - `handleContext` default `maxNodes` drops from 20 → 8 when `fileCount < 150` +- **Cosmos Q1 flipped**: WIN ($0.257 vs $0.449, n=1; n=2 avg $0.341 vs $0.350 tied). The breakthrough was `inlineEndpoint`'s "Other functions in TO's file" siblings — `msgServer.Send`'s real callee `k.Keeper.SendCoins` is an embedded-interface call tree-sitter can't statically resolve, so static `getCallees` returns only utility funcs; the *actual* flow lives in `x/bank/keeper/send.go`'s file-mates. See `handleTrace` line ~1430. +- **Empirical lower bounds on tool gating** (n=2-3 audits): + - 5 tools (search+context+node+explore+trace) = current setting, works + - 3 tools (search+context+trace) = cobra 17→48% loss, sinatra 18→96% loss; agent falls back to Reads when node/explore unavailable + - 1 tool (search only) = catastrophic, express -43% WIN → +107% LOSS +- **n=3 measurements confirm structural floor:** cobra WITH consistently $0.28 (variance <5%), WITHOUT consistently $0.24. The $0.04 gap is structural, not noise. +- **The user's pivot question challenges this:** their hypothesis is that context+search+node may also be earning less than they cost. The audits we have can't directly answer that — every test had all 10 (or 5) tools available. To test, expose ONLY explore+trace on a controlled batch and re-measure. +- **Cross-language status (single-run each):** WINS = Go (multi-mod), Rust, Java, C#, Kotlin, Swift, Svelte, prometheus, ky (post-gating), express (JS). TIES = cobra (n=2 tied $0.27/$0.27), excalidraw, django, redis, json, Masonry, flutter, vapor, spring. LOSSES = sinatra, slim, flask, scala-play, Fusion, vue-core (variance), Drupal, NestJS, FastAPI, Laravel, ASP.NET, axum, actix, Rocket, gorilla/mux, SvelteKit, Charts bridge (slight), RN segmented-control (slight). +- **Loss pattern is structural, not language-specific.** All losses are tiny example/starter repos where the without-arm grep+read path costs ~$0.20-0.30 and codegraph's MCP overhead can't be amortized. + +## Gotchas + +- **PR-494 is a Go-multi-module PR by title but the body is now cross-cutting** — generated-detection, IFACE_OVERRIDE_LANGS, tool gating, all language-agnostic. Don't let the title narrow what's in it. +- **The variance on the WITHOUT arm is enormous** — same-repo single-run cost can swing $0.04 to $0.80 depending on whether the agent goes grep-heavy or read-heavy that turn. **Never conclude WIN/LOSS from n=1.** The session has many single-run results that need confirming. +- **Cobra (~50 files) is the canary** — every aggressive cut that helps ky or sinatra has regressed cobra at least once. It's the most-tested tiny repo because of that. +- **Don't try the 1-tool or 3-tool gate again** — both are explicitly documented as regressions in `getTools()` comments (`src/mcp/tools.ts` around line 660). Cutting below 5 forces the agent to Read. +- **Kong's first audit was a 0-byte index** — parallel `audit.sh` runs against the same .codegraph dir can corrupt each other. If kong/any-repo's audit shows wildly wrong numbers, check `stat /tmp/codegraph-corpus//.codegraph/codegraph.db` before iterating on the result. +- **48-parallel audit launches FAIL silently** — system resource limits. Stay at 6-8 parallel max. Use `wait` between waves. +- **The MCP daemon caches the tool list** at process start — when iterating on `getTools()` you MUST `pkill -f "codegraph.js serve --mcp"` between rebuilds or you'll be testing stale code. +- **`maxCharsPerFile` monotonic invariant** is pinned by `__tests__/explore-output-budget.test.ts` (the spec is `a larger tier must NEVER get a smaller maxCharsPerFile than a smaller tier`). Honor it. + +## How to test & validate + +- `npm test` → "Tests 1076 passed | 2 skipped". Must stay green. +- `npm run build 2>&1 | tail -3` → check dist rebuilt cleanly. +- `pkill -f "codegraph.js serve --mcp" ; sleep 2` → ALWAYS run before agent-eval after a build, otherwise the daemon serves stale code. +- Single-question audit: `AGENT_EVAL_OUT=/tmp/cg-NAME /Users/colby/Development/Personal/codegraph/scripts/agent-eval/run-all.sh "" headless`. Outputs `run-headless-with.jsonl` and `run-headless-without.jsonl`. +- Parse: `node scripts/agent-eval/parse-run.mjs /tmp/cg-NAME/run-headless-{with,without}.jsonl` → cost, duration, turns, tool sequence. +- **For real conclusions, always n=2 minimum.** n=3 is the right bar to separate variance from signal — last session's data on cobra showed WITH had <5% variance but WITHOUT swung 95%. +- **The explore + trace experiment** the user wants: modify `getTools()` to filter visible tools to `new Set(['codegraph_explore', 'codegraph_trace'])` for ALL repos (or just the tiny tier first), re-run cosmos-Q1, etcd-Q1, prometheus, cobra n=2 each, and compare. + +## Repo state + +- branch `feat/go-multi-module-trace-quality`, last commit `ae5364c docs(mcp): pin empirical lower bound on tool gating after n=2 micro test` +- uncommitted: clean +- PR: https://github.com/colbymchenry/codegraph/pull/494 (13 commits, ready for review unless we land the tool-surface redesign) + +## Open threads / TODO + +- [ ] **The user's pivot**: prove or disprove that explore + trace alone is sufficient. Set up a 4-repo × n=2 batch (cosmos-Q1, etcd-Q1, prometheus, cobra) with ONLY explore+trace exposed, compare to current 5-tool / 10-tool baselines. +- [ ] If explore+trace alone wins → cut the tool surface across the board. **This is a breaking API change** — callers/callees/impact/status/files/node would disappear from default exposure. Need a clean way to retain them for users who script against the MCP directly (env var? `--full-tools` flag?). +- [ ] If explore+trace alone loses → identify which of context/search/node is doing the structural work, and propose cutting only the others. +- [ ] **README update either way**: the current "~35% cheaper" claim averages 7 medium/large repos. Either commit to that scope ("real codebases (~200+ files)") or re-measure after the tool surface change. +- [ ] Liquid, Pascal/Delphi, React Router, TurboModules, Expo Modules, Paper view managers — still untested categories from the README. Bridges Swift↔ObjC/RN-legacy/RN-events/Fabric were tested in wave 3 — 1 win, 2 tied, 1 slight loss. The rest are still gaps. +- [ ] If we ship the PR as-is, write a CHANGELOG entry under `[Unreleased]` summarizing the 13 commits — currently the CHANGELOG entry covers commits 1-2 (generated-detection + gRPC bridge + trace UX); commits 3-13 need their own bullets. + +## Recent transcript (last ~10 turns, oldest → newest) + +### Turn N-9 — User: "Have we tested every language/framework per the playbook?" +- **Claude:** Audited coverage honestly — ~70% of README's claims (22 langs + 14 frameworks + 8 bridges) had NOT been validated. Offered three scopes (smoke / medium / full playbook). +- **Outcome:** User chose "Full playbook"; later refined to "Smoke-test only the completely-untested categories." + +### Turn N-8 — Wave 1 smoke-test (8 parallel: JS/Dart/Drupal/NestJS/FastAPI/Vapor/Spring/Rails) +- **Claude:** Launched 8 audits via `run-all.sh`. Results: express WIN (-43%), 3 tied, 4 LOSS (Drupal, NestJS, FastAPI, Rails). NestJS 113% loss, FastAPI 194% loss were the standouts. +- **Outcome:** Documented the loss pattern correlates with repo size, not language. + +### Turn N-7 — Wave 2 (7 frameworks: Laravel, ASP.NET, axum, actix, Rocket, gorilla/mux, SvelteKit) + Wave 3 (4 bridges: Charts, async-storage, RN-geolocation, RN-segmented-control) +- **Claude:** All wave-2 frameworks lost on cost (small example/starter repos). All wave-3 bridges won/tied on cost AND won on reads. +- **Outcome:** Confirmed loss pattern is tiny-repo-structural, not framework-specific. Bridges (codegraph's marquee feature) work. + +### Turn N-6 — User: "We have a lot of losses. Work on those until they are all clear wins. You cannot stop or give up or push back." +- **Claude:** Ran n=3 on the 8 worst losses to separate variance from signal. First attempt with 48 parallel processes silently failed (system limits). Re-ran in waves of 6. +- **Outcome:** n=3 cobra: WITH consistent $0.28 (variance <5%), WITHOUT consistent $0.24. **Confirmed losses are structural, not variance.** + +### Turn N-5 — Tested 3-tool gate (search + context + trace only) +- **Claude:** Edited `getTools()` to expose only 3 tools on `<150` files. Re-ran cobra/ky/sinatra. Results: cobra 17→48% LOSS, ky 47→42% loss, sinatra 18→96% LOSS. Reverted. +- **Outcome:** Cutting below 5 tools regresses because the agent falls back to extra Reads. + +### Turn N-4 — Tested 1-tool gate (search only) on micro (<50 files) +- **Claude:** Edited gate. cobra 126% LOSS, ky 44% loss, express FLIPPED from -43% WIN to +107% LOSS. Catastrophic. Reverted. +- **Outcome:** Documented the empirical lower bound: 5 tools. Committed `ae5364c` with comments warning future tuners not to relitigate. + +### Turn N-3 — Claude proposed three paths forward (auto-disable / README honesty / context-rewrite) +- **Claude:** Used AskUserQuestion to offer auto-disable on <50, honest README, or last-ditch context shrink. +- **Outcome:** User pivoted instead of picking any of those. + +### Turn N-2 — User: "Why do we need codegraph_context, or any of these massive amounts of tools? All it really needs is explore, and trace if you ask me." +- **Claude:** Recognized this is a design pivot, not a continuation of the cost-gap iteration. The right next move is to actually test the user's hypothesis, not to write another response defending the status quo. +- **Outcome:** This handoff captures the pivot for a fresh session to answer properly. + +### Turn N-1 — User: `/handoff save` +- **Claude:** Wrote this file. +- **Outcome:** Handoff persisted. Next session reads it and engages the explore+trace-only design question with measurement, not opinion. diff --git a/.claude/skills/agent-eval/corpus.json b/.claude/skills/agent-eval/corpus.json index e81a98ada..2cfedac4f 100644 --- a/.claude/skills/agent-eval/corpus.json +++ b/.claude/skills/agent-eval/corpus.json @@ -11,7 +11,8 @@ "Go": [ { "name": "cobra", "repo": "https://github.com/spf13/cobra", "size": "Small", "files": "~50", "question": "How does cobra parse commands and flags?" }, { "name": "gin", "repo": "https://github.com/gin-gonic/gin", "size": "Medium", "files": "~150", "question": "How does gin route requests through its middleware chain?" }, - { "name": "terraform", "repo": "https://github.com/hashicorp/terraform", "size": "Large", "files": "~4000", "question": "How does Terraform build and walk the resource dependency graph?" } + { "name": "terraform", "repo": "https://github.com/hashicorp/terraform", "size": "Large", "files": "~4000", "question": "How does Terraform build and walk the resource dependency graph?" }, + { "name": "cosmos-sdk", "repo": "https://github.com/cosmos/cosmos-sdk", "size": "Large", "files": "~5000", "question": "How does a bank module MsgSend message reach the account balance update? Trace the cross-module call path from the bank keeper's Send handler through to the account/balance store update." } ], "Python": [ { "name": "click", "repo": "https://github.com/pallets/click", "size": "Small", "files": "~60", "question": "How does click parse command-line arguments into commands?" }, diff --git a/CHANGELOG.md b/CHANGELOG.md index 5bc5086a1..8ecf14e00 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,122 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [Unreleased] ### Added +- **Generated-file down-ranking across search, trace, and explore.** A new + filename-based classifier (`src/extraction/generated-detection.ts`) flags + protobuf / gRPC / mockgen / build-output files (`.pb.go`, `.pulsar.go`, + `_grpc.pb.go`, `_mock.go`, `_mocks.go`, `mock_*.go`, `.generated.[jt]sx`, + `_pb2(_grpc)?.py`, `.pb.{cc,h}`, `.g.dart`, `.freezed.dart`) and pushes them + LAST in disambiguation. Before this, a `codegraph_search "Send"` on + cosmos-sdk returned the gRPC interface stub at `tx_grpc.pb.go:124` as the + first match — the trace landed on that empty stub, reported "no path", and + the agent fell back to Read. With the down-rank applied to `findSymbol`, + `findAllSymbols`, `codegraph_search`, the CLI `query` command, AND the + context Entry Points / Related Symbols / Code blocks, the bank keeper's + `msgServer.Send` (the real implementation) ranks #3 instead of #9 and + trace lands on it directly. Pure path-based classifier — no schema change, + no index migration. +- **gRPC interface→implementation bridge for Go.** New synthesizer + `goGrpcStubImplEdges` in `src/resolution/callback-synthesizer.ts` finds + `UnimplementedXxxServer` structs in `.pb.go` / `_grpc.pb.go` files, + identifies their RPC-method signatures (excluding the `mustEmbed*` / + `testEmbeddedByValue` gRPC markers), and links each stub method to the + hand-written impl method on any struct whose method-name set is a + superset. Closes Go's structural-typing gap that the Java/Kotlin-only + `interfaceOverrideEdges` couldn't bridge. Excludes other generated files + from candidate impls so a sibling `msgClient` in the same `.pb.go` doesn't + get falsely paired. Measured on cosmos-sdk: 467 stub→impl `calls` edges + synthesized, bank's `UnimplementedMsgServer::Send` now points only to + `x/bank/keeper/msg_server.go::msgServer::Send` — not to mocks, not to + client wrappers. +- **Trace-failure response now inlines both endpoints' bodies + neighbors.** + When `codegraph_trace` can't find a static call path (typically a + dynamic-dispatch break), it used to return a one-liner telling the agent + to call `codegraph_node` next — which triggered 3-4 follow-up calls plus a + Read. The new failure response inlines each endpoint's source (capped at + 120 lines / 3600 chars), callers, and callees in one response. On the + cosmos-Q3 / etcd-Q2 audits this eliminated the entire fan-out pattern + (5-11 codegraph calls collapsed into 1-2). +- **Path-proximity pairing in trace endpoint selection.** In a multi-module + Go repo, a symbol like `EndBlocker` exists in 20+ modules; FTS picks one + almost arbitrarily. Trace now scores every `from` × `to` candidate pair by + shared directory prefix length (longest match wins) so + `x/gov/abci.go::EndBlocker` + `x/gov/keeper/tally.go::Tally` are paired + before `simapp/app.go`'s wrapper EndBlocker is even considered. A + less-canonical-path penalty (`enterprise/`, `contrib/`, `examples/`, + `vendor/`, `third_party/`, `deprecated/`, `legacy/`) ensures a side-module + with a longer shared prefix doesn't beat the canonical module with a + shorter one. FindPath probe budget capped at 20 pairs. +- **Test-file deprioritization in `codegraph_explore`.** Existing + `isLowValue` only caught directory-style patterns (`/tests/`, `/spec/`); + now also catches Go's `_test.go`, Ruby's `_spec.rb`, JS/TS `.test.ts` / + `.spec.tsx`, and Java/Kotlin/Scala `*Test.java` / `*Spec.kt`. Without + this, etcd's `watchable_store_test.go` consumed 5K chars of explore + budget that should have gone to the hand-written flow source. +- **Small-repo retrieval tuning (`<500` indexed files).** Three coordinated + changes so small projects resolve flow questions in 1-2 MCP calls instead + of 3-5. (i) MCP tool surface drops to the 5 core tools + (`codegraph_search` / `codegraph_context` / `codegraph_node` / + `codegraph_explore` / `codegraph_trace`); the other 5 (`codegraph_callers` + /`codegraph_callees`/`codegraph_impact`/`codegraph_status`/`codegraph_files`) + cost more in tool-list overhead than they recoup at this scale. + Empirically validated as the floor — n=2 audits showed cutting below + 5 regresses cobra/ky/sinatra (3-tool gate) and catastrophically regresses + express (1-tool gate, +107% LOSS). (ii) `codegraph_context` responses end + with a strong directive telling the agent the response IS the + comprehensive pass for a project this size and follow-ups should be + narrow (`trace from→to`, single-symbol `node`) — not another broad + `codegraph_explore` that re-bundles the same content. (iii) Explore + output budget gets a sub-150 tier (13K total / 4 files / 3.8K each, + Relationships section dropped, test/spec/icon/i18n files hard-excluded + from the relevant-file set unless the query is about tests), and + `codegraph_context` `maxNodes` defaults to 8 instead of 20. +- **`codegraph_context` auto-traces flow queries.** When the task reads + like "how does X reach Y", "trace the path from A to B", or "how does + X propagate through Z", `codegraph_context` now runs the trace + internally and splices its body into the response. Detection is + conservative — needs a flow keyword AND ≥2 distinct PascalCase / + camelCase identifiers, with the first two ordered by appearance taken + as `from`/`to`. On dynamic-dispatch breaks it falls back to the + trace-failure response (which already inlines both endpoint bodies + + neighbors). Saves the follow-up `codegraph_trace` that was the #2 + cost driver on multi-module flow questions in the audit. +- **Routing-manifest inline in `codegraph_context` for small-repo + routing queries.** When the task mentions + routes/handlers/endpoints/middleware/etc. on a sub-500-file project, + `codegraph_context` now appends a compact URL → handler table built + from `route` nodes + their `references`/`calls` edges, then inlines + the full source (≤16KB) of the file holding the most handler + endpoints. Targets the Glob+Read pattern that was beating codegraph + on realworld template repos (rails-realworld, laravel-realworld, + drupal-admintoolbar, …) where the agent would just read `routes.rb` / + `web.php` instead of asking the graph. Manifest is silently skipped + when fewer than 3 non-test routes exist or no file holds ≥30% of + them (no single answer file). +- **Core-directory ranking boost in `codegraph_context` search.** + Projects with one file holding the dense majority of internal call + edges (e.g. sinatra's `lib/sinatra/base.rb` at ~85% of all in-file + edges) now get search results in that file's directory boosted by + +25 score. Fixes the case where a small extension file with a + verbatim name match outranks the actual framework core + (sinatra-contrib's `multi_route.rb` `route` was outranking + base.rb's `route!`). Test and generated files are excluded from + "dominant file" candidacy so etcd's `rpc.pb.go` (1916 in-file + edges, generated protobuf) can't beat the hand-written + `server/etcdserver/server.go` (470 edges). +- **Interface → implementation synthesis extended beyond JVM.** + `interfaceOverrideEdges` previously bridged interface methods to + concrete impls in Java/Kotlin only. Now also runs for C#, TypeScript, + JavaScript, Swift, and Scala — Swift conformance also iterates + `struct` nodes (value-type protocol conformance) alongside `class`. + Closes the same structural-typing gap the new Go gRPC bridge closes, + for any language where the resolver emits explicit + `implements`/`extends` edges. +- **Shorter MCP tool descriptions.** All 10 `codegraph_*` tool + descriptions condensed (typically ~50% shorter), keeping the + "use this for X / prefer over Y" steering but dropping the longer + rationale (which lives in `server-instructions.ts`, the + load-bearing channel). Tool-list bytes on the agent side drop + proportionally; cumulative across multi-tool sessions. - **Java / Kotlin imports now resolve by fully-qualified name.** Extraction wraps every top-level declaration of a `.kt` / `.java` file in a `namespace` node carrying the file's `package` (so a class `Bar` in @@ -39,6 +155,18 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). now sees the four anonymous overrides in its trail without a Read. ### Fixed +- **MCP tools no longer return rows for files deleted while no server was + running.** The post-open catch-up sync that reconciles the index against + the working tree (catching `git pull`/`checkout`/`rebase` and any edits + or deletes made between sessions) was fire-and-forget — so a tool call + that landed in the first ~50–300ms could race past it and serve rows + for files that no longer exist on disk. The per-file staleness banner + couldn't help here, because that signal is populated by the file + watcher (which doesn't see pre-startup changes). Now the first tool + call of the session awaits the catch-up before serving; subsequent + calls pay nothing. Most visible on the "deleted everything between + sessions" case, where MCP now returns the correct empty index instead + of stale rows. Validated end-to-end on a 10,640-file VS Code index. - **`codegraph index` / `init -i` summary now reports the true edge count.** The per-file counter in the orchestrator only saw extraction-phase edges, so resolution and synthesizer edges (often >50% of the graph on diff --git a/CLAUDE.md b/CLAUDE.md index 5fd9b2787..6636bf606 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -256,3 +256,8 @@ publish actions on shared state. Write the files, hand the user the commands. - The `0.7.x` line is in active multi-agent rollout. Any change to `src/installer/` (especially `targets/`) needs corresponding test coverage and a CHANGELOG entry — installer regressions break every new install silently. - When changing what the MCP tools do or how agents should use them, update **all three** of `src/mcp/server-instructions.ts`, `src/installer/instructions-template.ts`, and `.cursor/rules/codegraph.mdc` — they're written to different places but say the same thing. - CodeGraph provides **code context**, not product requirements. For new features, ask the user about UX, edge cases, and acceptance criteria — the graph won't tell you. +- **When the user references issues, PR comments, or external reports, anchor them to a date and version before drawing conclusions.** Check the comment's `createdAt` against: + - The **last released version** — `grep -m1 '^## \[' CHANGELOG.md` shows the top-of-file version (older releases follow). A comment dated before the latest `## [X.Y.Z] - YYYY-MM-DD` is reacting to *released* state — work that's only on `main` or on an unmerged branch doesn't apply. + - The **last main commit** — `git log --first-parent main -1 --format='%ai %h %s'`. A comment after the last release but before a fix on main may already be addressed there but unreleased. + - The **current branch's tip** — your own unmerged work obviously can't be what the comment is reacting to. + Always disambiguate "released," "merged-but-unreleased," and "in-progress" before agreeing that a user-reported problem is unfixed (or that a fix is incomplete). A user saying "your fix only covers X" about a recent PR is usually pointing at the *released* shortcomings — your in-flight branch may already address them but they have no way to know that. diff --git a/__tests__/explore-output-budget.test.ts b/__tests__/explore-output-budget.test.ts index 65ddc6488..cd1a444d5 100644 --- a/__tests__/explore-output-budget.test.ts +++ b/__tests__/explore-output-budget.test.ts @@ -33,10 +33,16 @@ describe('getExploreOutputBudget', () => { }); it('uses tier breakpoints matching getExploreBudget so call-count and output-budget agree on a project', () => { - // Anything in the same tier should pick the same total-output cap. - const tier1a = getExploreOutputBudget(50); + // Very-tiny tier (<150 files) gets a tighter cap than small (150-499) — + // paired with tool gating to handle the MCP-overhead-dominates regime. + const tier0a = getExploreOutputBudget(50); + const tier0b = getExploreOutputBudget(149); + expect(tier0a.maxOutputChars).toBe(tier0b.maxOutputChars); + + const tier1a = getExploreOutputBudget(150); const tier1b = getExploreOutputBudget(499); expect(tier1a.maxOutputChars).toBe(tier1b.maxOutputChars); + // The <500 explore-call budget covers both very-tiny and small. expect(getExploreBudget(50)).toBe(getExploreBudget(499)); const tier2a = getExploreOutputBudget(500); @@ -49,6 +55,7 @@ describe('getExploreOutputBudget', () => { expect(tier3a.maxOutputChars).toBe(tier3b.maxOutputChars); // And crossing a breakpoint changes the cap. + expect(tier0a.maxOutputChars).not.toBe(tier1a.maxOutputChars); expect(tier1a.maxOutputChars).not.toBe(tier2a.maxOutputChars); expect(tier2a.maxOutputChars).not.toBe(tier3a.maxOutputChars); }); @@ -67,8 +74,12 @@ describe('getExploreOutputBudget', () => { expect(medium.includeBudgetNote).toBe(true); }); - it('keeps the Relationships section on for every tier — it is the cheapest structural signal', () => { - expect(getExploreOutputBudget(50).includeRelationships).toBe(true); + it('keeps the Relationships section on for medium+ tiers — small tiers drop it to maximize body density', () => { + // ITER2: relationships dropped on <500 tiers; on tiny repos the + // per-call payload is the cost driver, so even "cheap" structural + // signal adds up across follow-up turns. Re-enabled at ≥500 where + // body budgets are roomy enough to absorb the 1-2KB overhead. + expect(getExploreOutputBudget(50).includeRelationships).toBe(false); expect(getExploreOutputBudget(1000).includeRelationships).toBe(true); expect(getExploreOutputBudget(10000).includeRelationships).toBe(true); expect(getExploreOutputBudget(30000).includeRelationships).toBe(true); @@ -91,8 +102,11 @@ describe('getExploreOutputBudget', () => { }); it('handles the boundary file counts exactly (off-by-one regression guard)', () => { - // 499 -> small tier, 500 -> medium tier - expect(getExploreOutputBudget(499).maxOutputChars).toBe(getExploreOutputBudget(100).maxOutputChars); + // 149 -> very-tiny, 150 -> small + expect(getExploreOutputBudget(149).maxOutputChars).toBe(getExploreOutputBudget(50).maxOutputChars); + expect(getExploreOutputBudget(150).maxOutputChars).toBe(getExploreOutputBudget(200).maxOutputChars); + // 499 -> small, 500 -> medium + expect(getExploreOutputBudget(499).maxOutputChars).toBe(getExploreOutputBudget(200).maxOutputChars); expect(getExploreOutputBudget(500).maxOutputChars).toBe(getExploreOutputBudget(1000).maxOutputChars); // 4999 -> medium, 5000 -> large expect(getExploreOutputBudget(4999).maxOutputChars).toBe(getExploreOutputBudget(1000).maxOutputChars); diff --git a/__tests__/frameworks-integration.test.ts b/__tests__/frameworks-integration.test.ts index 3e9ef12eb..344a0f6c9 100644 --- a/__tests__/frameworks-integration.test.ts +++ b/__tests__/frameworks-integration.test.ts @@ -805,3 +805,106 @@ describe('Java anonymous-class override synthesis — end-to-end', () => { cg.close(); }); }); + +describe('Go gRPC stub→impl synthesis', () => { + let tmpDir: string | undefined; + afterEach(() => { + if (tmpDir) fs.rmSync(tmpDir, { recursive: true, force: true }); + tmpDir = undefined; + }); + + it('bridges UnimplementedMsgServer methods to the hand-written keeper impl', async () => { + tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-go-grpc-')); + // Mimic protoc-gen-go-grpc output: `*_grpc.pb.go` carrying the + // UnimplementedMsgServer stub. + fs.writeFileSync( + path.join(tmpDir, 'tx_grpc.pb.go'), + 'package banktypes\n\n' + + 'type UnimplementedMsgServer struct{}\n\n' + + 'func (UnimplementedMsgServer) Send(ctx context.Context, req *MsgSend) (*MsgSendResponse, error) { return nil, nil }\n' + + 'func (UnimplementedMsgServer) MultiSend(ctx context.Context, req *MsgMultiSend) (*MsgMultiSendResponse, error) { return nil, nil }\n' + + 'func (UnimplementedMsgServer) mustEmbedUnimplementedMsgServer() {}\n' + + 'func (UnimplementedMsgServer) testEmbeddedByValue() {}\n' + ); + // Hand-written impl in a non-generated file — what an agent actually + // wants the trace to land on. + fs.writeFileSync( + path.join(tmpDir, 'msg_server.go'), + 'package keeper\n\n' + + 'type msgServer struct{ k Keeper }\n\n' + + 'func (m msgServer) Send(ctx context.Context, req *MsgSend) (*MsgSendResponse, error) {\n' + + ' return m.k.SendCoins(ctx, req.From, req.To, req.Amount)\n' + + '}\n' + + 'func (m msgServer) MultiSend(ctx context.Context, req *MsgMultiSend) (*MsgMultiSendResponse, error) {\n' + + ' return nil, nil\n' + + '}\n' + ); + + let cg: CodeGraph | undefined; + try { + cg = CodeGraph.initSync(tmpDir); + await cg.indexAll(); + + const stubSend = cg + .getNodesByKind('method') + .find((n) => n.qualifiedName.endsWith('UnimplementedMsgServer::Send')); + const implSend = cg + .getNodesByKind('method') + .find((n) => n.qualifiedName.endsWith('msgServer::Send')); + expect(stubSend, 'UnimplementedMsgServer.Send should be indexed').toBeDefined(); + expect(implSend, 'msgServer.Send should be indexed').toBeDefined(); + + const bridge = cg + .getOutgoingEdges(stubSend!.id) + .find((e) => e.target === implSend!.id && e.kind === 'calls'); + expect(bridge, 'stub Send should bridge to impl Send').toBeDefined(); + expect(bridge!.provenance).toBe('heuristic'); + expect((bridge!.metadata as { synthesizedBy?: string } | undefined)?.synthesizedBy).toBe( + 'go-grpc-stub-impl' + ); + } finally { + cg?.close(); + } + }); + + it('does not bridge to candidates living in another generated file', async () => { + tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-go-grpc-sib-')); + // `*_grpc.pb.go` also contains a sibling `msgClient` struct that + // happens to satisfy the same method set. We must NOT bridge to it — + // it's not the hand-written impl, just the gRPC client wrapper. + fs.writeFileSync( + path.join(tmpDir, 'tx_grpc.pb.go'), + 'package banktypes\n\n' + + 'type UnimplementedMsgServer struct{}\n' + + 'func (UnimplementedMsgServer) Send() {}\n' + + 'func (UnimplementedMsgServer) MultiSend() {}\n\n' + + 'type msgClient struct{}\n' + + 'func (m msgClient) Send() {}\n' + + 'func (m msgClient) MultiSend() {}\n' + ); + + let cg: CodeGraph | undefined; + try { + cg = CodeGraph.initSync(tmpDir); + await cg.indexAll(); + + const stub = cg + .getNodesByKind('struct') + .find((n) => n.name === 'UnimplementedMsgServer'); + expect(stub).toBeDefined(); + const bridges = cg + .getNodesByKind('method') + .filter((n) => n.qualifiedName.endsWith('UnimplementedMsgServer::Send')) + .flatMap((stubSend) => cg!.getOutgoingEdges(stubSend.id)) + .filter( + (e) => + e.kind === 'calls' && + (e.metadata as { synthesizedBy?: string } | undefined)?.synthesizedBy === + 'go-grpc-stub-impl', + ); + expect(bridges, 'no bridge to msgClient (also generated)').toHaveLength(0); + } finally { + cg?.close(); + } + }); +}); diff --git a/__tests__/generated-detection.test.ts b/__tests__/generated-detection.test.ts new file mode 100644 index 000000000..90bbae7f1 --- /dev/null +++ b/__tests__/generated-detection.test.ts @@ -0,0 +1,47 @@ +/** + * Regression coverage for the generated-file detector that drives + * symbol-disambiguation down-ranking. Locked here because the suffix + * list is a contract: if a future edit drops `.pb.go`, the cosmos-sdk + * trace endpoint regresses to the gRPC stub (see + * `project_go_multi_module_audit` memory + the audit in #N/A). + */ + +import { describe, it, expect } from 'vitest'; +import { isGeneratedFile } from '../src/extraction/generated-detection'; + +describe('isGeneratedFile', () => { + it('classifies Go protobuf / gRPC / pulsar / mock outputs as generated', () => { + expect(isGeneratedFile('api/cosmos/bank/v1beta1/tx_grpc.pb.go')).toBe(true); + expect(isGeneratedFile('x/bank/types/tx.pb.go')).toBe(true); + expect(isGeneratedFile('api/cosmos/bank/v1beta1/tx.pulsar.go')).toBe(true); + // cosmos-sdk uses `_mocks.go`; mockgen's default is `mock_.go`; + // many projects use `_mock.go`. All three are mockgen output. + expect(isGeneratedFile('x/auth/testutil/expected_keepers_mocks.go')).toBe(true); + expect(isGeneratedFile('internal/foo_mock.go')).toBe(true); + expect(isGeneratedFile('mock_keeper.go')).toBe(true); + }); + + it('does not flag the hand-written keeper as generated', () => { + expect(isGeneratedFile('x/bank/keeper/msg_server.go')).toBe(false); + expect(isGeneratedFile('x/bank/keeper/send.go')).toBe(false); + }); + + it('catches common cross-language codegen suffixes', () => { + expect(isGeneratedFile('app/foo.generated.ts')).toBe(true); + expect(isGeneratedFile('app/foo.generated.tsx')).toBe(true); + expect(isGeneratedFile('proto/bar_pb2.py')).toBe(true); + expect(isGeneratedFile('proto/bar_pb2_grpc.py')).toBe(true); + expect(isGeneratedFile('lib/baz.pb.cc')).toBe(true); + expect(isGeneratedFile('lib/baz.pb.h')).toBe(true); + expect(isGeneratedFile('lib/quux.g.dart')).toBe(true); + expect(isGeneratedFile('lib/quux.freezed.dart')).toBe(true); + }); + + it('leaves ordinary source files alone', () => { + expect(isGeneratedFile('src/index.ts')).toBe(false); + expect(isGeneratedFile('src/components/Foo.tsx')).toBe(false); + expect(isGeneratedFile('lib/main.dart')).toBe(false); + expect(isGeneratedFile('cmd/server/main.go')).toBe(false); + expect(isGeneratedFile('app/db.py')).toBe(false); + }); +}); diff --git a/__tests__/mcp-catchup-gate.test.ts b/__tests__/mcp-catchup-gate.test.ts new file mode 100644 index 000000000..6baee07c4 --- /dev/null +++ b/__tests__/mcp-catchup-gate.test.ts @@ -0,0 +1,122 @@ +/** + * MCP catch-up gate — first tool call blocks on the engine's post-open + * filesystem reconcile so it never serves rows for files that were + * deleted (or edited) while no MCP server was running. + * + * Background: `MCPEngine.catchUpSync()` fires `cg.sync()` in the background. + * Before this fix it was fire-and-forget — a tool call could race past it + * and return rows for files that no longer exist on disk. The per-file + * staleness banner (`withStalenessNotice`) couldn't help, because + * `getPendingFiles()` is populated by the watcher, not by catch-up. + * + * The fix: `catchUpSync()` pushes its promise into the `ToolHandler` via + * `setCatchUpGate(p)`; the first `execute()` call awaits the gate and then + * clears it. These tests exercise the gate directly (deterministic) and + * the engine-driven path (proves the engine actually pokes the gate). + */ + +import { describe, it, expect, beforeEach, afterEach } from 'vitest'; +import * as fs from 'fs'; +import * as path from 'path'; +import * as os from 'os'; +import CodeGraph from '../src/index'; +import { ToolHandler } from '../src/mcp/tools'; + +describe('MCP catch-up gate', () => { + let testDir: string; + let cg: CodeGraph; + let handler: ToolHandler; + + beforeEach(async () => { + testDir = fs.mkdtempSync(path.join(os.tmpdir(), 'codegraph-catchup-gate-')); + fs.mkdirSync(path.join(testDir, 'src')); + fs.writeFileSync( + path.join(testDir, 'src', 'survivor.ts'), + 'export function survivor() { return 1; }\n', + ); + fs.writeFileSync( + path.join(testDir, 'src', 'deleted-later.ts'), + 'export function deletedLater() { return 2; }\n', + ); + + cg = CodeGraph.initSync(testDir, { config: { include: ['**/*.ts'], exclude: [] } }); + await cg.indexAll(); + handler = new ToolHandler(cg); + }); + + afterEach(() => { + try { cg.unwatch(); } catch { /* ignore */ } + try { cg.close(); } catch { /* ignore */ } + if (fs.existsSync(testDir)) fs.rmSync(testDir, { recursive: true, force: true }); + }); + + it('awaits the gate before serving the first tool call', async () => { + let gateResolved = false; + const gate = new Promise((resolve) => { + setTimeout(() => { gateResolved = true; resolve(); }, 80); + }); + handler.setCatchUpGate(gate); + + const res = await handler.execute('codegraph_search', { query: 'survivor' }); + expect(gateResolved).toBe(true); + expect(res.isError).toBeFalsy(); + expect(res.content[0].text).toMatch(/survivor/); + }); + + it('drops the gate after first await — second call does not re-wait', async () => { + let awaitCount = 0; + const gate = new Promise((resolve) => { + awaitCount++; + setTimeout(resolve, 20); + }); + handler.setCatchUpGate(gate); + + await handler.execute('codegraph_search', { query: 'survivor' }); + const before = awaitCount; + await handler.execute('codegraph_search', { query: 'survivor' }); + // The promise body runs once when constructed; second execute never + // resubscribes to a fresh promise because the gate field was nulled. + expect(awaitCount).toBe(before); + }); + + it('catch-up reconciles a deleted file before the first tool call sees it', async () => { + // Simulate the empty-project / deleted-files startup case: file is in + // the DB (we indexed it above) but vanishes from disk before the MCP + // server's first query. The catch-up sync, awaited via the gate, + // must remove the row so the first tool call returns no hit. + fs.unlinkSync(path.join(testDir, 'src', 'deleted-later.ts')); + + // Push the actual catch-up sync as the gate — same flow the MCP engine + // uses (`cg.sync()` returns a Promise, the wrapper voids it). + handler.setCatchUpGate(cg.sync().then(() => undefined)); + + const res = await handler.execute('codegraph_search', { query: 'deletedLater' }); + expect(res.isError).toBeFalsy(); + const text = res.content[0].text; + expect(text).not.toMatch(/src\/deleted-later\.ts/); + }); + + it('catch-up that converges the project to 0 files clears all rows', async () => { + // Worst case: every source file is gone between sessions. Without the + // gate, the first tool call serves whatever was in the DB. With the + // gate + the orchestrator's filesystem reconcile, the DB drains. + fs.unlinkSync(path.join(testDir, 'src', 'survivor.ts')); + fs.unlinkSync(path.join(testDir, 'src', 'deleted-later.ts')); + + handler.setCatchUpGate(cg.sync().then(() => undefined)); + + const res = await handler.execute('codegraph_search', { query: 'survivor' }); + expect(res.isError).toBeFalsy(); + expect(cg.getStats().fileCount).toBe(0); + }); + + it('gate that rejects does not break the tool call', async () => { + // A catch-up sync failure (lock contention, transient FS error) must + // not poison tool dispatch — the engine logs it, the handler proceeds. + handler.setCatchUpGate(Promise.reject(new Error('simulated sync failure'))); + + const res = await handler.execute('codegraph_search', { query: 'survivor' }); + expect(res.isError).toBeFalsy(); + expect(res.content[0].text).toMatch(/survivor/); + }); +}); diff --git a/scripts/agent-eval/probe-sweep.mjs b/scripts/agent-eval/probe-sweep.mjs new file mode 100755 index 000000000..0018bbcaf --- /dev/null +++ b/scripts/agent-eval/probe-sweep.mjs @@ -0,0 +1,119 @@ +#!/usr/bin/env node +// probe-sweep — direct MCP test across N repos × N tools, no claude needed. +// +// Measures response characteristics (size, sections present, signals fired) +// for each (repo, query) pair against the built dist/. Sub-second per probe; +// the full sweep below runs in ~10-30s vs hours for a real claude audit. +// +// Use this to iterate on backend changes rapidly: change tools.ts / +// context-builder, npm run build, re-run probe-sweep, compare. Once a +// change looks good on probe metrics, run a focused claude audit for the +// few repos that matter to confirm end-to-end cost behavior. +// +// Usage: node scripts/agent-eval/probe-sweep.mjs [--tool=context|explore|trace] [--repos=a,b,c] +import { pathToFileURL } from 'node:url'; +import { resolve } from 'node:path'; + +const args = Object.fromEntries( + process.argv.slice(2).map(a => a.startsWith('--') ? a.slice(2).split('=') : [a, true]) +); +const TOOL = args.tool ?? 'context'; + +const load = (rel) => import(pathToFileURL(resolve(rel)).href); +const idx = await load('dist/index.js'); +const tools = await load('dist/mcp/tools.js'); +const CodeGraph = idx.default?.default ?? idx.default ?? idx.CodeGraph; +const ToolHandler = tools.ToolHandler ?? tools.default?.ToolHandler; + +// Each entry: repo, query, optional 2nd arg for trace (from, to). +// The query is the same prompt used in the real claude audits, so probe +// output is directly comparable to the agent's would-be input. +const SWEEP = [ + // Small realworld template repos (the loss cases from the cross-language sweep) + { id: 'gin-rw', repo: '/tmp/codegraph-corpus/gin-realworld', q: 'How does this Gin app route a request through its middleware chain to a handler?' }, + { id: 'go-mux', repo: '/tmp/codegraph-corpus/go-mux', q: 'How does this gorilla/mux app route a request to its handler?' }, + { id: 'fastapi-rw', repo: '/tmp/codegraph-corpus/fastapi-realworld', q: 'How does FastAPI route a request through its dependencies to a handler?' }, + { id: 'spring-pc', repo: '/tmp/codegraph-corpus/spring-petclinic', q: 'How does Spring route an HTTP request to a controller method?' }, + { id: 'axum-rw', repo: '/tmp/codegraph-corpus/rust-axum-realworld', q: 'How does Axum route a request to its handler in this app?' }, + { id: 'express-rw', repo: '/tmp/codegraph-corpus/express-realworld', q: 'How does this Express app route a request through middleware to a handler?' }, + { id: 'kotlin-pc', repo: '/tmp/codegraph-corpus/kotlin-petclinic', q: 'How does the Kotlin Spring app route an HTTP request to its handler?' }, + { id: 'flask-mb', repo: '/tmp/codegraph-corpus/flask-microblog', q: 'How does this Flask app route a request to a view function?' }, + { id: 'vapor-tpl', repo: '/tmp/codegraph-corpus/vapor-template', q: 'How does Vapor route an HTTP request to its handler?' }, + { id: 'cpp-leveldb', repo: '/tmp/codegraph-corpus/cpp-leveldb', q: 'How does LevelDB handle a Put operation through to disk?' }, + { id: 'lualine', repo: '/tmp/codegraph-corpus/lualine.nvim', q: 'How does lualine assemble and render the statusline?' }, + { id: 'drupal-admin', repo: '/tmp/codegraph-corpus/drupal-admintoolbar', q: 'How does the Drupal admin toolbar module render its toolbar?' }, + { id: 'svelte-rw', repo: '/tmp/codegraph-corpus/svelte-realworld', q: 'How does this SvelteKit app route a request to a handler?' }, + { id: 'react-rw', repo: '/tmp/codegraph-corpus/react-realworld', q: 'How does this React app fetch and display articles?' }, + { id: 'rails-rw', repo: '/tmp/codegraph-corpus/rails-realworld', q: 'How does Rails route a request to a controller action?' }, + { id: 'flask-rest', repo: '/tmp/codegraph-corpus/flask-restful-realworld', q: 'How does Flask-RESTful route a request to a resource method?' }, + { id: 'laravel-rw', repo: '/tmp/codegraph-corpus/laravel-realworld', q: 'How does Laravel route a request to the controller method?' }, + { id: 'aspnet-rw', repo: '/tmp/codegraph-corpus/aspnet-realworld', q: 'How does ASP.NET route a request to the controller action?' }, + // The iter7 wins/ties (to make sure we don't regress) + { id: 'cobra', repo: '/tmp/codegraph-corpus/cobra', q: 'How does cobra parse commands and flags?' }, + { id: 'sinatra', repo: '/tmp/codegraph-corpus/sinatra', q: 'How does sinatra route a request to its handler?' }, + { id: 'slim', repo: '/tmp/codegraph-corpus/slim', q: 'How does slim route a request and apply middleware?' }, +]; + +// Detect signals in response text — these are the levers we've added that +// otherwise only show up via "agent ran X more tool calls" downstream. +const detect = (text) => ({ + hasEntryPoints: /^### Entry Points/m.test(text), + hasRelatedSymbols: /^### Related Symbols/m.test(text), + hasFlowTrace: /^## Inline flow trace/m.test(text), + hasRouteManifest: /^## Routing manifest/m.test(text), + hasTopHandler: /^### Top handler file/m.test(text), + hasSmallRepoTail: /This project is small/.test(text), +}); + +const filterRepos = args.repos ? new Set(String(args.repos).split(',')) : null; +const subjects = SWEEP.filter(s => !filterRepos || filterRepos.has(s.id)); + +const t0 = Date.now(); +const rows = []; +for (const s of subjects) { + try { + const cg = CodeGraph.openSync(s.repo); + const handler = new ToolHandler(cg); + const t1 = Date.now(); + const res = await handler.execute('codegraph_' + TOOL, + TOOL === 'context' ? { task: s.q } : + TOOL === 'explore' ? { query: s.q } : { from: 'main', to: 'main' }); + const text = res.content?.[0]?.text ?? ''; + const signals = detect(text); + rows.push({ + id: s.id, + ms: Date.now() - t1, + chars: text.length, + lines: text.split('\n').length, + ...signals, + }); + try { cg.close?.(); } catch {} + } catch (e) { + rows.push({ id: s.id, error: String(e).slice(0, 80) }); + } +} + +// Pretty-print as a compact table. +const fmt = (r) => + r.error + ? ` ${r.id.padEnd(13)} ERROR: ${r.error}` + : ` ${r.id.padEnd(13)} ${String(r.chars).padStart(6)}c ${String(r.lines).padStart(4)}L ${String(r.ms).padStart(4)}ms` + + ` ${r.hasEntryPoints ? 'EP ' : ' '}` + + `${r.hasFlowTrace ? 'TRC ' : ' '}` + + `${r.hasRouteManifest ? 'MAN ' : ' '}` + + `${r.hasTopHandler ? 'HND ' : ' '}` + + `${r.hasSmallRepoTail ? 'TAIL' : ' '}`; +console.log(`=== probe-sweep tool=${TOOL} n=${subjects.length} (${Date.now() - t0}ms total) ===`); +console.log(' id chars lines ms signals'); +console.log(' ' + '-'.repeat(56)); +for (const r of rows) console.log(fmt(r)); + +// Sum + medians for the size pillar +const sizes = rows.filter(r => !r.error).map(r => r.chars); +sizes.sort((a, b) => a - b); +const median = sizes[Math.floor(sizes.length / 2)]; +const sum = sizes.reduce((a, b) => a + b, 0); +console.log(` ${'-'.repeat(64)}`); +console.log(` median=${median}c total=${sum}c ` + + `manifest=${rows.filter(r => r.hasRouteManifest).length}/${rows.filter(r => !r.error).length} ` + + `top-handler=${rows.filter(r => r.hasTopHandler).length}/${rows.filter(r => !r.error).length}`); diff --git a/src/bin/codegraph.ts b/src/bin/codegraph.ts index 3c3a082ff..86a59b2ab 100644 --- a/src/bin/codegraph.ts +++ b/src/bin/codegraph.ts @@ -843,11 +843,21 @@ program const cg = await CodeGraph.open(projectPath); const limit = parseInt(options.limit || '10', 10); - const results = cg.searchNodes(search, { + const rawResults = cg.searchNodes(search, { limit, kinds: options.kind ? [options.kind as any] : undefined, }); + // Mirror the MCP search down-rank so the CLI also surfaces the + // hand-written implementation before protobuf/gRPC scaffolding + // when both share a name. See extraction/generated-detection.ts. + const { isGeneratedFile } = await import('../extraction/generated-detection'); + const results = [...rawResults].sort((a, b) => { + const aGen = isGeneratedFile(a.node.filePath) ? 1 : 0; + const bGen = isGeneratedFile(b.node.filePath) ? 1 : 0; + return aGen - bGen; + }); + if (options.json) { console.log(JSON.stringify(results, null, 2)); } else { diff --git a/src/context/formatter.ts b/src/context/formatter.ts index 37a08ee84..748d17201 100644 --- a/src/context/formatter.ts +++ b/src/context/formatter.ts @@ -5,6 +5,7 @@ */ import { Node, Edge, TaskContext, Subgraph } from '../types'; +import { isGeneratedFile } from '../extraction/generated-detection'; /** * Format context as markdown @@ -21,10 +22,17 @@ export function formatContextAsMarkdown(context: TaskContext): string { lines.push('## Code Context\n'); lines.push(`**Query:** ${context.query}\n`); - // Entry points - compact format - if (context.entryPoints.length > 0) { + // Entry points - compact format. Re-sort so generated files (.pb.go, + // .pulsar.go, mocks, …) rank LAST — a flow query should lead with the + // hand-written implementation, not protobuf scaffolding. + const orderedEntries = [...context.entryPoints].sort((a, b) => { + const aGen = isGeneratedFile(a.filePath) ? 1 : 0; + const bGen = isGeneratedFile(b.filePath) ? 1 : 0; + return aGen - bGen; + }); + if (orderedEntries.length > 0) { lines.push('### Entry Points\n'); - for (const node of context.entryPoints) { + for (const node of orderedEntries) { const location = node.startLine ? `:${node.startLine}` : ''; lines.push(`- **${node.name}** (${node.kind}) - ${node.filePath}${location}`); if (node.signature) { @@ -34,9 +42,14 @@ export function formatContextAsMarkdown(context: TaskContext): string { lines.push(''); } - // Related symbols - compact list (skip verbose structure tree) + // Related symbols - compact list (skip verbose structure tree). Drop nodes + // in generated source files (`.pb.go` / `.pulsar.go` / mocks / …) — agents + // chasing a flow never want to land on protobuf scaffolding (cosmos-Q3 used + // to list `gov.pulsar.go::GetExpeditedThreshold` and `1.pulsar.go::Get` in + // Related Symbols, pure noise that displaced real-flow entries). const otherSymbols = Array.from(context.subgraph.nodes.values()) .filter(n => !context.entryPoints.some(e => e.id === n.id)) + .filter(n => !isGeneratedFile(n.filePath)) .slice(0, 10); // Limit to 10 related symbols if (otherSymbols.length > 0) { @@ -55,10 +68,16 @@ export function formatContextAsMarkdown(context: TaskContext): string { lines.push(''); } - // Code blocks - only for key entry points + // Code blocks - only for key entry points. Re-sort so non-generated blocks + // show first (consistent with Entry Points reordering above). if (context.codeBlocks.length > 0) { + const orderedBlocks = [...context.codeBlocks].sort((a, b) => { + const aGen = isGeneratedFile(a.filePath) ? 1 : 0; + const bGen = isGeneratedFile(b.filePath) ? 1 : 0; + return aGen - bGen; + }); lines.push('### Code\n'); - for (const block of context.codeBlocks) { + for (const block of orderedBlocks) { const nodeName = block.node?.name ?? 'Unknown'; lines.push(`#### ${nodeName} (${block.filePath}:${block.startLine})\n`); lines.push('```' + block.language); diff --git a/src/context/index.ts b/src/context/index.ts index da4c0bf05..7e6619e8b 100644 --- a/src/context/index.ts +++ b/src/context/index.ts @@ -587,6 +587,37 @@ export class ContextBuilder { } } + // Iter7 — Core-directory boost. On projects with one file that holds + // the dense majority of internal call edges (e.g. sinatra's + // `lib/sinatra/base.rb` at 85% of all in-file edges), the agent's + // task usually asks about the framework's core. Without this boost, + // ranking favors small focused extension files (e.g. text search + // picks `sinatra-contrib/lib/sinatra/multi_route.rb`'s 10-line + // `route` method over `base.rb`'s `route!` because the extension + // file's `route` matches the query verbatim AND the file is small, + // dwarfing the longer name `route!` in a 1500-line file). Boost + // results that share a directory prefix with the dominant file's + // directory so the core file's siblings outrank sibling-package + // extensions. + try { + const dominant = this.queries.getDominantFile?.(); + if (dominant && dominant.edgeCount >= 3 * dominant.nextEdgeCount) { + // Take the directory of the dominant file (everything up to the + // last slash). For `lib/sinatra/base.rb` → `lib/sinatra/`. + const slash = dominant.filePath.lastIndexOf('/'); + if (slash > 0) { + const coreDir = dominant.filePath.slice(0, slash + 1); + for (const result of searchResults) { + if (result.node.filePath.startsWith(coreDir)) { + result.score += 25; + } + } + } + } + } catch { + // SQL failure — fall through, scoring works without the boost + } + // Step 5a: Multi-term co-occurrence re-ranking (applied BEFORE truncation). // For multi-word queries like "search execution from request to shard", // nodes matching 2+ query terms in their name or path are far more relevant diff --git a/src/db/queries.ts b/src/db/queries.ts index 11f5bc34c..a0ac31eea 100644 --- a/src/db/queries.ts +++ b/src/db/queries.ts @@ -20,6 +20,32 @@ import { import { safeJsonParse } from '../utils'; import { kindBonus, nameMatchBonus, scorePathRelevance } from '../search/query-utils'; import { parseQuery, boundedEditDistance } from '../search/query-parser'; +import { isGeneratedFile } from '../extraction/generated-detection'; + +/** + * Path-only heuristic for files that should not be candidates for + * "dominant file" detection: test/spec files and tool-generated files. + * Generated files (`*.pb.go`, `*.pulsar.go`, mock outputs, …) often + * have huge in-file edge counts that dwarf the real source — etcd's + * `rpc.pb.go` has 4× the in-file edges of `server.go`. + */ +function isLowValueFile(filePath: string): boolean { + const lp = filePath.toLowerCase(); + return ( + /(?:^|\/)(tests?|__tests?__|spec)\//.test(lp) || + /_test\.go$/.test(lp) || + /(?:^|\/)test_[^/]+\.py$/.test(lp) || + /_test\.py$/.test(lp) || + /_spec\.rb$/.test(lp) || + /_test\.rb$/.test(lp) || + /\.(test|spec)\.[jt]sx?$/.test(lp) || + /(test|spec|tests)\.(java|kt|scala)$/.test(lp) || + /(tests?|spec)\.cs$/.test(lp) || + /tests?\.swift$/.test(lp) || + /_test\.dart$/.test(lp) || + isGeneratedFile(filePath) + ); +} const SQLITE_PARAM_CHUNK_SIZE = 500; @@ -182,6 +208,9 @@ export class QueryBuilder { getUnresolvedBatch?: SqliteStatement; getAllFilePaths?: SqliteStatement; getAllNodeNames?: SqliteStatement; + getDominantFile?: SqliteStatement; + getTopRouteFile?: SqliteStatement; + getRoutingManifest?: SqliteStatement; } = {}; constructor(db: SqliteDatabase) { @@ -489,6 +518,158 @@ export class QueryBuilder { return rows.map(rowToNode); } + /** + * Find the file that holds the densest concentration of the project's + * internal call graph — the "core" file. Used by context-builder to + * boost ranking of symbols in that file's directory (so e.g. sinatra + * queries surface `lib/sinatra/base.rb`'s `route!` instead of + * `sinatra-contrib/lib/sinatra/multi_route.rb`'s `route` extension). + * + * Returns null if no file has a meaningful concentration (e.g. spread + * evenly across many files, or empty index). + * + * "Internal" = source and target are in the same file. Cross-file + * edges aren't useful here — they don't tell us which file is the + * functional center. + * + * Excludes test/spec files from candidacy via path-pattern. The agent's + * typical question is "how does X work", not "how is X tested", so + * boosting a test file's directory would be a misfire. + */ + getDominantFile(): { filePath: string; edgeCount: number; nextEdgeCount: number } | null { + if (!this.stmts.getDominantFile) { + // Pull top 20 candidates; we then filter out test/generated files + // in code (regex-grade matching that SQL LIKE can't express). The + // generated-file filter is critical — without it, etcd's + // `api/etcdserverpb/rpc.pb.go` (1916 in-file edges, generated + // protobuf stub) outranks the real `server/etcdserver/server.go` + // (470 edges) by 4×, and the boost would push the agent toward + // generated code. + this.stmts.getDominantFile = this.db.prepare(` + SELECT n.file_path AS file_path, COUNT(*) AS edge_count + FROM edges e + JOIN nodes n ON e.source = n.id + JOIN nodes m ON e.target = m.id + WHERE n.file_path = m.file_path + GROUP BY n.file_path + ORDER BY edge_count DESC + LIMIT 20 + `); + } + const rows = this.stmts.getDominantFile.all() as Array<{ file_path: string; edge_count: number }>; + const filtered = rows.filter(r => !isLowValueFile(r.file_path)); + if (filtered.length === 0 || filtered[0]!.edge_count < 20) return null; + return { + filePath: filtered[0]!.file_path, + edgeCount: filtered[0]!.edge_count, + nextEdgeCount: filtered[1]?.edge_count ?? 0, + }; + } + + /** + * Find the file that holds the densest concentration of the project's + * `route` nodes (framework-emitted: Express/Gin/Flask/Rails/Drupal/etc.). + * Used by handleContext on small repos to inline the project's routing + * config when the agent's query is about request flow — eliminating the + * "Glob + Read routes.rb" pattern that beats codegraph on tiny realworld + * template repos. + * + * Excludes test/generated files from candidacy. Returns null if there + * are fewer than 3 non-test routes total, or if no file holds at least + * 30% of them (diffuse routing → no single answer file). + */ + getTopRouteFile(): { filePath: string; routeCount: number; totalRoutes: number } | null { + if (!this.stmts.getTopRouteFile) { + this.stmts.getTopRouteFile = this.db.prepare(` + SELECT file_path, COUNT(*) AS cnt + FROM nodes + WHERE kind = 'route' + GROUP BY file_path + ORDER BY cnt DESC + LIMIT 20 + `); + } + const rows = this.stmts.getTopRouteFile.all() as Array<{ file_path: string; cnt: number }>; + const filtered = rows.filter(r => !isLowValueFile(r.file_path)); + if (filtered.length === 0) return null; + const totalRoutes = filtered.reduce((sum, r) => sum + r.cnt, 0); + const top = filtered[0]!; + if (totalRoutes < 3 || top.cnt < 3) return null; + if (top.cnt / totalRoutes < 0.30) return null; + return { filePath: top.file_path, routeCount: top.cnt, totalRoutes }; + } + + /** + * Build a URL → handler manifest from the index. Each route node's + * `references` edge points at the function/method that handles the + * request. We join them in one pass; the agent gets the canonical + * routing answer ("POST /users/login → AuthController#login") without + * having to parse the framework's route DSL itself. + * + * Also returns the file with the most handler endpoints — used as the + * "top handler file" to inline source for, so the agent has both the + * mapping AND the handler implementations. + */ + getRoutingManifest(limit: number = 40): { + entries: Array<{ url: string; handler: string; handlerFile: string; handlerLine: number; handlerKind: string }>; + topHandlerFile: string | null; + topHandlerFileCount: number; + totalRoutes: number; + } | null { + if (!this.stmts.getRoutingManifest) { + // Edge kind varies across framework resolvers: Spring/Rails/ + // Laravel/Drupal emit `references`, Express emits `calls`. Accept + // both — the semantic is the same (route → its handler). + this.stmts.getRoutingManifest = this.db.prepare(` + SELECT + r.name AS url, + h.name AS handler, + h.file_path AS handler_file, + h.start_line AS handler_line, + h.kind AS handler_kind + FROM nodes r + JOIN edges e ON e.source = r.id + JOIN nodes h ON e.target = h.id + WHERE r.kind = 'route' + AND e.kind IN ('references', 'calls') + AND h.kind IN ('function', 'method', 'class') + ORDER BY r.file_path, r.start_line + LIMIT ? + `); + } + const rows = this.stmts.getRoutingManifest.all(limit) as Array<{ + url: string; handler: string; handler_file: string; handler_line: number; handler_kind: string; + }>; + // Drop test/generated handlers — same hygiene as elsewhere. + const filtered = rows.filter(r => !isLowValueFile(r.handler_file)); + if (filtered.length < 3) return null; + // Identify the file holding the most handlers (the "primary handler file"). + const fileCounts = new Map(); + for (const r of filtered) { + fileCounts.set(r.handler_file, (fileCounts.get(r.handler_file) ?? 0) + 1); + } + let topHandlerFile: string | null = null; + let topHandlerFileCount = 0; + for (const [file, count] of fileCounts) { + if (count > topHandlerFileCount) { + topHandlerFile = file; + topHandlerFileCount = count; + } + } + return { + entries: filtered.map(r => ({ + url: r.url, + handler: r.handler, + handlerFile: r.handler_file, + handlerLine: r.handler_line, + handlerKind: r.handler_kind, + })), + topHandlerFile, + topHandlerFileCount, + totalRoutes: filtered.length, + }; + } + /** * Get all nodes of a specific kind */ diff --git a/src/extraction/generated-detection.ts b/src/extraction/generated-detection.ts new file mode 100644 index 000000000..bde190725 --- /dev/null +++ b/src/extraction/generated-detection.ts @@ -0,0 +1,78 @@ +/** + * Generated-file detection for symbol-disambiguation down-ranking. + * + * When a query like "Send" matches 17 symbols across protobuf scaffolding, + * test mocks, and the hand-written implementation, the FTS ranker often + * surfaces the generated stubs first because their names are identical + * to the implementation's name (validated empirically on cosmos-sdk — + * see project_go_multi_module_audit memory). Generated stubs frequently + * have no body to trace from, so the agent ends up reading source anyway. + * + * This helper is a pure path-based classifier consulted at disambiguation + * time (findSymbol / findAllSymbols / codegraph_search formatting), NOT + * a hard filter — generated nodes are still in the graph and remain + * reachable; they just rank LAST when there's a real implementation + * with the same name. + * + * Scope: suffix patterns only. Most generated files follow the + * `..` convention (`.pb.go`, `_grpc.pb.go`, + * `.g.dart`, `_pb2.py`), and that covers ~all of what we saw in the + * Go audit. A future addition would be scanning for the canonical + * `// Code generated by` header during extraction, for the rare files + * that defy the suffix convention. + */ + +const GENERATED_PATTERNS: ReadonlyArray = [ + // Go — protobuf / gRPC / pulsar + /\.pb\.go$/, + /\.pulsar\.go$/, + /_grpc\.pb\.go$/, + // Go — mockgen output. Default emits `mock_.go`; many projects + // (cosmos-sdk uses `expected_*_mocks.go`) rename to `*_mock.go` / + // `*_mocks.go`. Matching either suffix catches both conventions + // without false-positive risk on hand-written sources. + /_mock\.go$/, + /_mocks\.go$/, + /^mock_[^/]+\.go$/, + // TypeScript / JavaScript — common codegen suffixes (Apollo / GraphQL + // codegen, Prisma, Hasura, ts-proto, gRPC-web, swagger-codegen). + /\.generated\.[jt]sx?$/, + /\.gen\.[jt]sx?$/, + /\.pb\.[jt]s$/, + /_pb\.[jt]s$/, + /_grpc_pb\.[jt]s$/, + // Python — protobuf / gRPC / openapi-codegen + /_pb2(_grpc)?\.py$/, + /_pb2\.pyi$/, + // C++ — protobuf + /\.pb\.(cc|h)$/, + // C# — protobuf / gRPC (protoc-gen-csharp puts output under obj/ but + // many projects also commit *.g.cs and *Grpc.cs siblings) + /\.g\.cs$/, + /Grpc\.cs$/, + // Java — protobuf / gRPC: protoc-gen-java emits `*OuterClass.java`, + // protoc-gen-grpc-java emits `*Grpc.java`. The XxxImplBase abstract + // class lives inside Xxx*Grpc.java. + /OuterClass\.java$/, + /Grpc\.java$/, + // Swift — protobuf + /\.pb\.swift$/, + // Dart — build_runner / freezed / json_serializable / chopper + /\.g\.dart$/, + /\.freezed\.dart$/, + /\.pb\.dart$/, + /\.pbgrpc\.dart$/, + /\.chopper\.dart$/, + // Rust — common build.rs OUT_DIR outputs are usually outside the source + // tree, but in-tree generated files often use `*.generated.rs`. + /\.generated\.rs$/, +]; + +/** + * Whether `filePath` looks like a tool-generated source file based on + * its filename. Path-only — does not read content. The result is a + * relevance hint for disambiguation, not a hard claim. + */ +export function isGeneratedFile(filePath: string): boolean { + return GENERATED_PATTERNS.some((p) => p.test(filePath)); +} diff --git a/src/index.ts b/src/index.ts index 14b0fb0a6..ee3bf51fa 100644 --- a/src/index.ts +++ b/src/index.ts @@ -683,6 +683,33 @@ export class CodeGraph { return this.queries.searchNodes(query, options); } + /** + * Find the project's "primary route file" — the file with the densest + * concentration of framework-emitted `route` nodes (≥3 routes, ≥30% + * of all non-test routes). Used to inline the routing config in + * `codegraph_context` responses on small realworld template repos + * (rails-realworld, laravel-realworld, drupal-admintoolbar, …) where + * Glob+Read of `routes.rb`/`urls.py`/etc. otherwise beats codegraph. + */ + getTopRouteFile(): { filePath: string; routeCount: number; totalRoutes: number } | null { + return this.queries.getTopRouteFile(); + } + + /** + * Build a URL → handler routing manifest from the index. Each entry + * pairs a route node (URL + method) with its handler function/method + * via the `references` edge that framework resolvers emit. Returns + * null when fewer than 3 valid (non-test) routes exist. + */ + getRoutingManifest(limit?: number): { + entries: Array<{ url: string; handler: string; handlerFile: string; handlerLine: number; handlerKind: string }>; + topHandlerFile: string | null; + topHandlerFileCount: number; + totalRoutes: number; + } | null { + return this.queries.getRoutingManifest(limit); + } + // =========================================================================== // Edge Operations // =========================================================================== diff --git a/src/mcp/engine.ts b/src/mcp/engine.ts index 15439b047..9ba89da1e 100644 --- a/src/mcp/engine.ts +++ b/src/mcp/engine.ts @@ -222,12 +222,17 @@ export class MCPEngine { /** * Reconcile the index with the current filesystem once, right after open — * catches edits, adds, deletes, and `git pull`/`checkout` changes made while - * no watcher was running. Background, never awaited. + * no watcher was running. Runs in the background, but the returned promise + * is pushed into the ToolHandler as a one-shot gate so the *first* tool + * call awaits completion before serving (without this, a tool call that + * races past sync returns rows for files that no longer exist on disk — + * and the per-file staleness banner can't help because `getPendingFiles()` + * is populated by the watcher, not by catch-up). */ private catchUpSync(): void { const cg = this.cg; if (!cg) return; - void cg + const p = cg .sync() .then((result) => { const changed = result.filesAdded + result.filesModified + result.filesRemoved; @@ -239,6 +244,7 @@ export class MCPEngine { const msg = err instanceof Error ? err.message : String(err); process.stderr.write(`[CodeGraph MCP] Catch-up sync failed: ${msg}\n`); }); + this.toolHandler.setCatchUpGate(p); } } diff --git a/src/mcp/tools.ts b/src/mcp/tools.ts index 5ed057af3..09d1831d9 100644 --- a/src/mcp/tools.ts +++ b/src/mcp/tools.ts @@ -21,10 +21,13 @@ import { lstatSync, openSync, readFileSync, + statSync, writeSync, } from 'fs'; import { clamp, validatePathWithinRoot, validateProjectPath } from '../utils'; +import { isGeneratedFile } from '../extraction/generated-detection'; import { tmpdir } from 'os'; +import * as pathModule from 'path'; import { join, resolve as resolvePath } from 'path'; /** Maximum output length to prevent context bloat (characters) */ @@ -123,21 +126,52 @@ export interface ExploreOutputBudget { includeCompletenessSignal: boolean; /** Include the explore-budget reminder at the end. */ includeBudgetNote: boolean; + /** + * Hard-drop test/spec/icon/i18n files from the relevant-file set unless + * the query itself mentions tests. Today they're only deprioritized in + * the sort, which on tiny repos still lets one slip into the top N (e.g. + * cobra's `command_test.go` displaced `args.go` and contributed ~10KB of + * pure noise to "How does cobra parse commands?"). Off by default; on + * for the very-tiny tier where one slip dominates the budget. + */ + excludeLowValueFiles: boolean; } export function getExploreOutputBudget(fileCount: number): ExploreOutputBudget { + if (fileCount < 150) { + return { + // ITER3: revert iter2's aggressive body shrink (forced Read fallback — + // the per-file 2.5K cap pushed the agent to Read instead of node). + // Back to the iter1 shape (13K/4/3.8K) but keep the test-file + // hard-exclude. The cost lever for this tier lives in handleContext + // (steering the agent to stop after 1-2 calls), not in this budget. + maxOutputChars: 13000, + defaultMaxFiles: 4, + maxCharsPerFile: 3800, + gapThreshold: 7, + maxSymbolsInFileHeader: 5, + maxEdgesPerRelationshipKind: 4, + includeRelationships: false, + includeAdditionalFiles: false, + includeCompletenessSignal: false, + includeBudgetNote: false, + excludeLowValueFiles: true, + }; + } if (fileCount < 500) { return { + // ITER3: same revert/keep-filter pattern as <150. maxOutputChars: 18000, defaultMaxFiles: 5, maxCharsPerFile: 3800, gapThreshold: 8, maxSymbolsInFileHeader: 6, maxEdgesPerRelationshipKind: 6, - includeRelationships: true, + includeRelationships: false, includeAdditionalFiles: false, includeCompletenessSignal: false, includeBudgetNote: false, + excludeLowValueFiles: true, }; } if (fileCount < 5000) { @@ -157,6 +191,7 @@ export function getExploreOutputBudget(fileCount: number): ExploreOutputBudget { includeAdditionalFiles: true, includeCompletenessSignal: true, includeBudgetNote: true, + excludeLowValueFiles: false, }; } if (fileCount < 15000) { @@ -171,6 +206,7 @@ export function getExploreOutputBudget(fileCount: number): ExploreOutputBudget { includeAdditionalFiles: true, includeCompletenessSignal: true, includeBudgetNote: true, + excludeLowValueFiles: false, }; } return { @@ -184,6 +220,7 @@ export function getExploreOutputBudget(fileCount: number): ExploreOutputBudget { includeAdditionalFiles: true, includeCompletenessSignal: true, includeBudgetNote: true, + excludeLowValueFiles: false, }; } @@ -382,7 +419,7 @@ export const tools: ToolDefinition[] = [ }, { name: 'codegraph_context', - description: 'PRIMARY TOOL — call this FIRST for any "how does X work", architecture, feature, or bug-context question. Composes search + node + callers + callees and returns entry points, related symbols, and key code in ONE call — usually enough to answer with no further search/Read/Grep. Prefer this over chaining codegraph_search + codegraph_node, and over codegraph_explore. NOTE: provides CODE context, not product requirements; for new features still clarify UX/edge cases with the user.', + description: 'PRIMARY TOOL — call FIRST for any "how does X work"/architecture/bug question. Returns entry points + related symbols + key code in one call; usually answers without further search/Read/Grep. Provides CODE context, not product requirements.', inputSchema: { type: 'object', properties: { @@ -407,7 +444,7 @@ export const tools: ToolDefinition[] = [ }, { name: 'codegraph_callers', - description: 'Find all functions/methods that call a specific symbol. Useful for understanding usage patterns and impact of changes.', + description: 'List functions that call . For deep flow use codegraph_trace.', inputSchema: { type: 'object', properties: { @@ -427,7 +464,7 @@ export const tools: ToolDefinition[] = [ }, { name: 'codegraph_callees', - description: 'Find all functions/methods that a specific symbol calls. Useful for understanding dependencies and code flow.', + description: 'List functions that calls. For deep flow use codegraph_trace.', inputSchema: { type: 'object', properties: { @@ -447,7 +484,7 @@ export const tools: ToolDefinition[] = [ }, { name: 'codegraph_impact', - description: 'Analyze the impact radius of changing a symbol. Shows what code could be affected by modifications.', + description: 'List symbols affected by changing . Use before a refactor.', inputSchema: { type: 'object', properties: { @@ -467,7 +504,7 @@ export const tools: ToolDefinition[] = [ }, { name: 'codegraph_node', - description: 'Get ONE symbol\'s details (location, signature, docstring) PLUS its TRAIL — what it calls and what calls it, each with file:line. Pass includeCode=true for source (functions return their body; containers return a member outline). Use this to WALK the call graph hop-by-hop — node a symbol, then node one of its trail entries — the structural, no-Read way to follow "what calls/triggers/handles X" across files. For a broad first overview of many symbols at once use codegraph_explore; use node to drill along a specific path from there. (If a trail is empty on a non-leaf, that hop is likely dynamic dispatch — read just that line.) Source returned with includeCode is the verbatim live file content — identical to Read.', + description: 'One symbol\'s location, signature, callers/callees trail. includeCode=true returns the verbatim body. Use codegraph_trace for full paths instead of chaining nodes.', inputSchema: { type: 'object', properties: { @@ -487,7 +524,7 @@ export const tools: ToolDefinition[] = [ }, { name: 'codegraph_explore', - description: 'Returns source for SEVERAL related symbols grouped by file, plus a relationship map, in ONE capped call. This is the efficient way to inspect many related symbols at once — strongly prefer it over a series of codegraph_node or Read calls (each separate call re-reads the whole context, so 8 node calls cost far more than 1 explore). Use it after codegraph_context when you need to see the actual source of several symbols. Query with specific symbol/file/code terms, NOT natural-language sentences — run codegraph_search first to find names. Bad: "how are agent prompts loaded and passed to the CLI". Good: "renderStaticScene drawElementOnCanvas ShapeCache renderElement.ts". The code it returns is the VERBATIM live file source (byte-for-byte identical to Read), line-numbered — not a summary; treat files it shows as already Read, no need to re-open them.', + description: 'Source of SEVERAL related symbols grouped by file, in one capped call. Query is a bag of symbol/file names (not a question). Returned source is verbatim Read-equivalent — do not re-open shown files. Prefer over chained codegraph_node.', inputSchema: { type: 'object', properties: { @@ -507,7 +544,7 @@ export const tools: ToolDefinition[] = [ }, { name: 'codegraph_status', - description: 'Get the status of the CodeGraph index, including statistics about indexed files, nodes, and edges.', + description: 'Index health check (files / nodes / edges). Skip unless debugging.', inputSchema: { type: 'object', properties: { @@ -517,7 +554,7 @@ export const tools: ToolDefinition[] = [ }, { name: 'codegraph_files', - description: 'REQUIRED for file/folder exploration. Get the project file structure from the CodeGraph index. Returns a tree view of all indexed files with metadata (language, symbol count). Much faster than Glob/filesystem scanning. Use this FIRST when exploring project structure, finding files, or understanding codebase organization.', + description: 'Indexed file tree with language + symbol counts. Faster than Glob for project layout.', inputSchema: { type: 'object', properties: { @@ -550,7 +587,7 @@ export const tools: ToolDefinition[] = [ }, { name: 'codegraph_trace', - description: 'Trace the CALL PATH between two symbols — "how does reach/become ?" Returns the chain of functions from one to the other (each hop with file:line and its body inlined, plus the outgoing calls of the destination itself) in ONE call. This is something grep/Read structurally cannot do: there is no text pattern for "the path from A to B". Ideal for flow questions — how an update triggers a render, how a request reaches a handler, how a QuerySet becomes SQL. If no static path exists the chain likely breaks at dynamic dispatch (callbacks/descriptors/metaclasses); the tool says where and points you to codegraph_node to bridge it.', + description: 'Call path between two symbols — "how does reach ?" Returns the chain with each hop\'s body inlined plus the destination\'s callees, in ONE call. Ideal for flow questions (update→render, request→handler, QuerySet→SQL). If no static path exists the chain broke at dynamic dispatch — the failure response inlines both endpoints + their TO-file siblings.', inputSchema: { type: 'object', properties: { @@ -587,6 +624,14 @@ export class ToolHandler { // once and every later tool call reuses the result — never shelling out to // git on the hot path. `undefined` = not computed yet; `null` = no mismatch. private worktreeMismatchCache: Map = new Map(); + // Gate that the MCP engine pokes after `cg.open()` so the first tool call + // blocks on the post-open filesystem reconcile (catch-up sync). Without + // this, a tool call that races past `catchUpSync()` serves rows for files + // that were deleted (or edited) while no MCP server was running — and the + // per-file staleness banner can't help, because `getPendingFiles()` is + // populated by the watcher, not by catch-up. Cleared on first await so + // subsequent calls don't pay any cost. + private catchUpGate: Promise | null = null; constructor(private cg: CodeGraph | null) {} @@ -597,6 +642,17 @@ export class ToolHandler { this.cg = cg; } + /** + * Engine-only: register the catch-up sync promise so the next `execute()` + * call awaits it before serving. The handler swallows rejections (the + * engine logs them) so a sync failure never propagates as a tool error; + * we still want to serve a best-effort result over the same potentially- + * stale data, which is what would have happened without the gate. + */ + setCatchUpGate(p: Promise | null): void { + this.catchUpGate = p; + } + /** * Record the directory the server tried to resolve the default project from. * Used only to make the "no default project" error actionable. @@ -642,7 +698,7 @@ export class ToolHandler { */ getTools(): ToolDefinition[] { const allow = this.toolAllowlist(); - const visible = allow + let visible = allow ? tools.filter(t => allow.has(t.name.replace(/^codegraph_/, ''))) : tools; if (!this.cg) return visible; @@ -651,6 +707,40 @@ export class ToolHandler { const stats = this.cg.getStats(); const budget = getExploreBudget(stats.fileCount); + // Tiny-repo tool gating: on projects under TINY_REPO_FILE_THRESHOLD + // files, only expose the 5 core tools (search, context, node, + // explore, trace). The 5 omitted tools (callers, callees, impact, + // status, files) reduce to one grep at this scale. + // + // n=2 audits ruled out cutting below 5 tools: + // - 3-tool gate (search + context + trace): cost regressed on + // cobra/ky/sinatra. The agent fell back to raw Reads to cover + // what codegraph_node + codegraph_explore would have answered. + // - 1-tool gate (search only): catastrophic regression — express + // went from -43% WIN to +107% LOSS. With only search, the agent + // can't navigate the call graph structurally and reads everything. + // + // 5 is the empirical lower bound. Tools beyond search/context/ + // node/explore/trace pay overhead that the agent doesn't recoup + // on tiny-repo flow questions. + // ITER4: raise threshold 150 → 500 so single-file frameworks + // (sinatra at 159, slim_framework around 200) also get the + // 5-tool surface. The empirical 5-tool floor was set on <150 + // probes; iter3 measurement showed sinatra is structurally the + // SAME problem as cobra (single-file WITHOUT-arm Read wins), + // so it deserves the same gating. + const TINY_REPO_FILE_THRESHOLD = 500; + const TINY_REPO_CORE_TOOLS = new Set([ + 'codegraph_search', + 'codegraph_context', + 'codegraph_node', + 'codegraph_explore', + 'codegraph_trace', + ]); + if (stats.fileCount < TINY_REPO_FILE_THRESHOLD) { + visible = visible.filter(t => TINY_REPO_CORE_TOOLS.has(t.name)); + } + return visible.map(tool => { if (tool.name === 'codegraph_explore') { return { @@ -928,6 +1018,16 @@ export class ToolHandler { */ async execute(toolName: string, args: Record): Promise { try { + // Block the first tool call on the engine's post-open reconcile so we + // never serve rows for files deleted/edited while no MCP server was + // running. The gate is cleared after first await — subsequent calls + // pay nothing. Catch-up failures are logged by the engine; we + // proceed regardless so a transient sync error never breaks tools. + if (this.catchUpGate) { + const gate = this.catchUpGate; + this.catchUpGate = null; + try { await gate; } catch { /* engine already logged */ } + } // Honor the optional tool allowlist (CODEGRAPH_MCP_TOOLS): a trimmed // surface rejects ablated tools defensively even if a client cached them. if (!this.isToolAllowed(toolName)) { @@ -1014,7 +1114,16 @@ export class ToolHandler { return this.textResult(`No results found for "${query}"`); } - const formatted = this.formatSearchResults(results); + // Down-rank generated files within the FTS-returned set so a search + // for "Send" surfaces the hand-written keeper before .pb.go stubs + // that share the name. Stable: only reorders generated vs. not. + const ranked = [...results].sort((a, b) => { + const aGen = isGeneratedFile(a.node.filePath) ? 1 : 0; + const bGen = isGeneratedFile(b.node.filePath) ? 1 : 0; + return aGen - bGen; + }); + + const formatted = this.formatSearchResults(ranked); return this.textResult(this.truncateOutput(formatted)); } @@ -1032,7 +1141,21 @@ export class ToolHandler { } const cg = this.getCodeGraph(args.projectPath as string | undefined); - const maxNodes = (args.maxNodes as number) || 20; + // On tiny repos (<150 files), trim maxNodes hard — the entire repo + // is grep-able in a turn so a 20-node context is wasted budget. + // 8 covers the typical 1-3 entry-point + their immediate neighbors + // without dragging in the rest of the small codebase. + let defaultMaxNodes = 20; + let isTinyRepo = false; + let isSmallRepo = false; + try { + const stats = cg.getStats(); + if (stats.fileCount < 150) { defaultMaxNodes = 8; isTinyRepo = true; } + else if (stats.fileCount < 500) { isSmallRepo = true; } + } catch { + // stats failure — fall back to the standard default + } + const maxNodes = (args.maxNodes as number) || defaultMaxNodes; const includeCode = args.includeCode !== false; const context = await cg.buildContext(task, { @@ -1047,13 +1170,190 @@ export class ToolHandler { ? '\n\n⚠️ **Ask user:** UX preferences, edge cases, acceptance criteria' : ''; + // Auto-trace for flow queries: when the task is asking "how does X + // reach/flow/propagate from A to B", run the trace internally and + // append its body to the context response. Saves the agent the + // follow-up codegraph_trace call that was the #2 cost driver on + // multi-module flow questions (Q3 / etcd Q2 in the audit). + const flowTrace = await this.maybeInlineFlowTrace(task, cg); + + // Iter3 — sufficiency steering on small repos. + // + // Measured economics on tiny (<150) and small (<500) projects: every + // additional MCP tool call costs ~$0.02-0.05 in cache-write tokens + // (5K-15K per response at $3.75/1M). The agent reflexively follows + // codegraph_context with explore/node even when the context response + // is already sufficient — that pattern drove the cost gap that + // smaller bodies (iter2) failed to close (smaller bodies just shifted + // the agent to Read instead). Direct directive on small-repo + // responses: tell the agent the context call IS the comprehensive + // pass for a project of this size and that follow-ups should be + // narrow (trace from→to, node single-symbol) — not another broad + // explore that re-bundles the same content. + // ITER4: unified strong directive for both tiny (<150) and small + // (<500) tiers — measured iter3 result was that the soft <500 + // wording was IGNORED on sinatra (5 tool calls, +92% loss) while + // the strong <150 wording was followed on cobra/slim (3 calls, + // -21%/-22% wins). The single-file-framework problem (sinatra) + // is structurally the same as cobra's; both deserve the same + // sufficiency steering. + let smallRepoTail = ''; + let smallRepoRouteInline = ''; + if (isTinyRepo || isSmallRepo) { + // Iter12: backend-computed routing manifest for routing queries. + // Builds a URL → handler map directly from the graph (each route + // node has a `references` edge to its handler), then inlines the + // top handler file's source. The agent gets the canonical + // routing answer in one MCP call — no need to parse framework + // DSL or grep for handlers. + // + // Replaces iter10's raw route-file inline. The manifest is more + // information-dense (parsed URL→handler map vs raw config DSL) + // and we still inline the top handler file's source so the agent + // has the implementation bodies inline too. + const isRouteQuery = /\b(route|routes|routing|request|handler|endpoint|api|controller|middleware|dispatch|invok)/i.test(task); + if (isRouteQuery) { + try { + const manifest = cg.getRoutingManifest(40); + if (manifest) { + // 1) Compact URL→handler list (~30-60 lines, ~1-2KB). + const lines: string[] = [ + `\n\n## Routing manifest (${manifest.totalRoutes} routes, top handler file holds ${manifest.topHandlerFileCount})`, + '', + '| URL | Handler | Location |', + '|---|---|---|', + ]; + for (const e of manifest.entries) { + lines.push(`| \`${e.url}\` | \`${e.handler}\` | ${e.handlerFile}:${e.handlerLine} |`); + } + // 2) Inline the top handler file's source. + if (manifest.topHandlerFile && manifest.topHandlerFileCount >= 2) { + try { + const fullPath = pathModule.join(cg.getProjectRoot(), manifest.topHandlerFile); + const stat = statSync(fullPath); + if (stat.size > 0 && stat.size <= 16000) { + const source = readFileSync(fullPath, 'utf-8'); + const capped = source.length > 7000 ? source.slice(0, 7000) + '\n... (truncated)' : source; + const ext = (manifest.topHandlerFile.match(/\.([a-z]+)$/i)?.[1] || '').toLowerCase(); + const lang = + ext === 'rb' ? 'ruby' : ext === 'py' ? 'python' : + ext === 'go' ? 'go' : ext === 'rs' ? 'rust' : + ext === 'js' || ext === 'jsx' ? 'javascript' : + ext === 'ts' || ext === 'tsx' ? 'typescript' : + ext === 'java' ? 'java' : ext === 'kt' ? 'kotlin' : + ext === 'cs' ? 'csharp' : ext === 'php' ? 'php' : + ext === 'swift' ? 'swift' : ext === 'yml' || ext === 'yaml' ? 'yaml' : ''; + lines.push(''); + lines.push(`### Top handler file (\`${manifest.topHandlerFile}\` — ${manifest.topHandlerFileCount}/${manifest.totalRoutes} routes, full source inlined — do NOT Read)`); + lines.push(''); + lines.push('```' + lang); + lines.push(capped); + lines.push('```'); + } + } catch { /* file read failed, skip the source inline */ } + } + smallRepoRouteInline = lines.join('\n'); + } + } catch { + // Manifest build failed — drop silently + } + } + const sizeQualifier = isTinyRepo ? 'under 150' : 'under 500'; + const routingClause = smallRepoRouteInline + ? ' The URL→handler manifest and top handler file are also inlined above — answer routing questions from them.' + : ''; + smallRepoTail = `\n\n---\n> **This project is small** (${sizeQualifier} indexed files). The entry points and code above cover the relevant surface — **do NOT call codegraph_explore as a follow-up; its content will largely duplicate this response**. If you need a specific flow, call \`codegraph_trace from→to\`. If you need one specific symbol's body, call \`codegraph_node \`.${routingClause} Otherwise, answer from what is above.`; + } + // buildContext returns string when format is 'markdown' if (typeof context === 'string') { - return this.textResult(this.truncateOutput(context + reminder)); + return this.textResult(this.truncateOutput(context + flowTrace + reminder + smallRepoRouteInline + smallRepoTail)); } // If it returns TaskContext, format it - return this.textResult(this.truncateOutput(this.formatTaskContext(context) + reminder)); + return this.textResult(this.truncateOutput(this.formatTaskContext(context) + flowTrace + reminder + smallRepoRouteInline + smallRepoTail)); + } + + /** + * Detect a flow-style task ("how does X reach Y", "trace the path from A to B") + * and pre-run trace between the most likely endpoints, returning the trace + * body to splice into the context response. Returns '' for non-flow queries + * or when no plausible endpoint pair can be extracted. + * + * Conservative by design: only fires when the task has both a clear flow + * keyword AND at least two distinct PascalCase / camelCase identifiers. + * False positives waste a graph query; false negatives just fall back to + * the agent calling trace itself (existing path-proximity wiring handles + * disambiguation either way). + */ + private async maybeInlineFlowTrace(task: string, cg: CodeGraph): Promise { + const lower = task.toLowerCase(); + const FLOW_KEYWORDS = [ + 'trace ', + 'from ', + 'reach ', + 'flow ', + 'propagat', + 'how does ', + 'how do ', + ]; + if (!FLOW_KEYWORDS.some((k) => lower.includes(k))) return ''; + + // Extract candidate symbols — PascalCase or camelCase identifiers ≥3 chars. + // Filter out common non-symbol words and the flow keywords themselves. + const STOP_WORDS = new Set([ + 'how', 'does', 'the', 'and', 'from', 'through', 'reach', 'reaches', + 'flow', 'path', 'trace', 'cross', 'module', 'modules', 'where', + 'update', 'updates', 'updated', 'when', 'what', 'this', 'that', + ]); + const ids: string[] = []; + const seen = new Set(); + const re = /\b([A-Z][a-z]+(?:[A-Z][a-z]*)+|[a-z]+[A-Z][a-z]*(?:[A-Z][a-z]*)*)\b/g; + let m: RegExpExecArray | null; + while ((m = re.exec(task)) !== null) { + const sym = m[1]!; + if (sym.length < 3) continue; + const key = sym.toLowerCase(); + if (STOP_WORDS.has(key) || seen.has(key)) continue; + seen.add(key); + ids.push(sym); + } + + if (ids.length < 2) return ''; + + // The first two distinct symbols, in order of appearance, are the most + // likely from/to endpoints — "from X ... through to Y" naturally places + // them in that order in the prose. If the trace fails to connect, it + // still returns the inlined endpoint bodies (the trace-failure rewrite). + const fromSym = ids[0]!; + const toSym = ids[1]!; + + let traceResult: ToolResult; + try { + traceResult = await this.handleTrace({ + from: fromSym, + to: toSym, + projectPath: cg.getProjectRoot(), + } as Record); + } catch { + return ''; + } + // Extract the textual body. Defensive: handleTrace's contract is the + // standard tool-result shape used elsewhere in this file. + const body = traceResult.content + ?.map((c) => (c.type === 'text' ? c.text : '')) + .filter(Boolean) + .join('\n') + .trim(); + if (!body) return ''; + return [ + '', + '## Inline flow trace', + '', + `Auto-traced \`${fromSym}\` → \`${toSym}\` because the query looks like a flow question. No follow-up codegraph_trace is needed for this pair.`, + '', + body, + ].join('\n'); } /** @@ -1232,41 +1532,185 @@ export class ToolHandler { // (which, on real code, means the flow breaks at dynamic dispatch). const edgeKinds: Edge['kind'][] = ['calls']; const MAX_HOPS = 7; - const fromTry = fromMatches.nodes.slice(0, 3); - const toTry = toMatches.nodes.slice(0, 3); + // Path-proximity pairing: in a multi-module repo a symbol name like + // `EndBlocker` exists in 20+ modules. FTS picks one almost arbitrarily; + // the WRONG pair (e.g. simapp's wrapper EndBlocker paired with gov's Tally) + // has no static path, falls through to the dynamic-dispatch failure branch, + // and surfaces unrelated bodies — exactly the cosmos-Q3 trace failure mode. + // Score every from×to combo by shared file-path prefix length; try the + // most-co-located pair first (e.g. `x/gov/abci.go::EndBlocker` × + // `x/gov/keeper/tally.go::Tally` share `x/gov/`). + // + // Consider the FULL candidate set, not just the FTS top-5: the right + // EndBlocker for a gov-module flow may rank 8th in FTS but share the + // entire `x/gov/` prefix with the destination. Path-proximity supersedes + // FTS for this disambiguation. Findpath trials are still capped by + // FINDPATH_PAIR_BUDGET below to bound graph traversal cost. + const sharedDirPrefixLen = (a: string, b: string): number => { + const aDir = a.replace(/[^/]+$/, ''); + const bDir = b.replace(/[^/]+$/, ''); + let i = 0; + while (i < aDir.length && i < bDir.length && aDir[i] === bDir[i]) i++; + return i; + }; + // Cosmos-Q3 surfaced a second-order failure: `enterprise/group/x/group/` + // SHARES MORE of its path with `enterprise/group/x/group/keeper/tally.go` + // (24 chars) than `x/gov/abci.go` shares with `x/gov/keeper/tally.go` + // (6 chars), so pure shared-prefix prefers the side-experiment module + // over the canonical one — even though the user's question is clearly + // about the main gov module. Penalize candidates living under prefixes + // that conventionally hold extensions / experiments / vendored code, so + // the canonical-path pair wins even when its shared prefix is short. + const isLessCanonicalPath = (p: string): boolean => + /^(enterprise|contrib|examples?|sample|playground|vendor|third[_-]?party|deprecated|legacy)\//i.test(p); + const LESS_CANONICAL_PENALTY = 100; // any canonical candidate beats any less-canonical one + const scorePair = (a: string, b: string): number => + sharedDirPrefixLen(a, b) + - (isLessCanonicalPath(a) ? LESS_CANONICAL_PENALTY : 0) + - (isLessCanonicalPath(b) ? LESS_CANONICAL_PENALTY : 0); + const fromCands = fromMatches.nodes; + const toCands = toMatches.nodes; + const pairs: Array<{ f: Node; t: Node; score: number }> = []; + for (const f of fromCands) { + for (const t of toCands) { + pairs.push({ f, t, score: scorePair(f.filePath, t.filePath) }); + } + } + // Sort by shared prefix desc, then by FTS order (already encoded in the + // pairs' insertion order — both for f and t). The tiebreaker preserves + // findAllSymbols' generated-file-last ranking. + pairs.sort((a, b) => b.score - a.score); + // Cap how many graph-path probes we attempt so a 50×50 cross-product + // doesn't blow up on a god-named symbol like `Get` (well-named flows have + // their good pair near the top of the sort anyway). + const FINDPATH_PAIR_BUDGET = 20; + const fromTry = fromCands; + const toTry = toCands; let path: Array<{ node: Node; edge: Edge | null }> | null = null; let overCap: Array<{ node: Node; edge: Edge | null }> | null = null; - for (const f of fromTry) { - for (const t of toTry) { - const p = cg.findPath(f.id, t.id, edgeKinds); - if (!p || p.length <= 1) continue; - if (p.length <= MAX_HOPS) { path = p; break; } - if (!overCap || p.length < overCap.length) overCap = p; - } + let bestPair: { f: Node; t: Node } | null = null; + let triedPairs = 0; + for (const { f, t } of pairs) { if (path) break; + if (triedPairs >= FINDPATH_PAIR_BUDGET) break; + triedPairs++; + const p = cg.findPath(f.id, t.id, edgeKinds); + if (p && p.length > 1) { + if (p.length <= MAX_HOPS) { path = p; bestPair = { f, t }; break; } + if (!overCap || p.length < overCap.length) { overCap = p; bestPair = { f, t }; } + } else if (!bestPair) { + // No path yet — remember the top-scored pair so the failure branch + // surfaces the most-co-located candidates' bodies, not whatever FTS + // happened to put first. + bestPair = { f, t }; + } } if (!path) { - // No static path — almost always a dynamic-dispatch break. Surface the - // start symbol's outgoing calls so the agent can bridge the gap. - const start = fromTry[0]!; - const callees = cg.getCallees(start.id).slice(0, 10) - .map(c => `${c.node.name} (${c.node.filePath}:${c.node.startLine})`); + // No static path — almost always a dynamic-dispatch break. INSTEAD of + // telling the agent to chase the gap with codegraph_node/callers/callees + // (which fans out into 3-4 follow-up tool calls + a Read), inline the + // material those would have returned right here. Measured on cosmos-Q3: + // the failed-trace + subsequent fan-out used to cost ~2× a single + // sufficient trace call; this branch closes that gap. + // Prefer the path-proximity-best pair we identified above (e.g. gov's + // EndBlocker × gov's Tally) over the FTS top-pick (simapp's wrapper). + const start = bestPair?.f ?? fromTry[0]!; + const end = bestPair?.t ?? toTry[0]!; + const fileCache = new Map(); const lines = [ - `No direct call path from "${from}" to "${to}".`, + `No direct static call path from "${from}" to "${to}" — the chain almost certainly breaks at dynamic dispatch (a callback / interface dispatch / framework hook / metaclass). Both endpoint bodies + their immediate neighbors are inlined below; answer from them — a follow-up codegraph_node/callers/callees on these would just return what is already here.`, '', - (overCap - ? `(Only a ${overCap.length}-hop indirect chain connects them — almost certainly a BFS wander through unrelated code, not the real flow.) ` - : '') + - 'The direct chain most likely breaks at **dynamic dispatch** (a callback, descriptor, ' + - 'metaclass, or attribute-as-callable) that static parsing cannot resolve into an edge. ' + - `Inspect \`${start.name}\` (${start.filePath}:${start.startLine}) with codegraph_node ` + - '(includeCode=true) — its body usually shows the dynamic call to follow next.', ]; - if (callees.length > 0) { - lines.push('', `**${start.name} statically calls:** ${callees.join(', ')}`); + if (overCap) { + lines.push( + `> Indirect chain of ${overCap.length} hops exists but is over the ${MAX_HOPS}-hop cap (usually a BFS wander through unrelated code, not the real execution flow).`, + '', + ); } - return this.textResult(lines.join('\n') + fromMatches.note + toMatches.note); + + // Track which node IDs we've already inlined a body for so we don't + // double-emit when a callee of FROM is also surfaced separately. + const inlinedBodies = new Set(); + const inlineBody = (n: Node, lineCap: number, charCap: number): boolean => { + if (inlinedBodies.has(n.id)) return false; + inlinedBodies.add(n.id); + const body = this.sourceRangeAt(cg, n.filePath, n.startLine, n.endLine, fileCache, lineCap, charCap); + if (body) { lines.push(body); return true; } + return false; + }; + + const inlineEndpoint = ( + label: 'FROM' | 'TO', + node: Node, + ) => { + lines.push(`### ${label}: \`${node.name}\` (${node.filePath}:${node.startLine}-${node.endLine})`); + inlineBody(node, 120, 3600); + const callers = cg.getCallers(node.id).slice(0, 6); + if (callers.length > 0) { + lines.push(`**Callers of \`${node.name}\`:** ` + + callers.map(c => `${c.node.name} (${c.node.filePath}:${c.node.startLine})`).join(', ')); + } + const callees = cg.getCallees(node.id).slice(0, 8); + if (callees.length > 0) { + lines.push(`**\`${node.name}\` calls:** ` + + callees.map(c => `${c.node.name} (${c.node.filePath}:${c.node.startLine})`).join(', ')); + } + lines.push(''); + }; + inlineEndpoint('FROM', start); + if (end.id !== start.id) inlineEndpoint('TO', end); + + // Inline the OTHER top-level functions/methods in TO's file — that's + // where the missing dynamic-dispatch flow usually lives. Concrete + // measurement from cosmos-Q1: `msgServer.Send` statically calls only + // utility functions (`StringToBytes`, `Wrapf`); its real next-hop + // `SendCoins` is invoked via an embedded-interface call (`k.Keeper.SendCoins`) + // that static parsing CAN'T see. The flow IS in the same file as the + // destination (`x/bank/keeper/send.go`: SendCoins → subUnlockedCoins → + // addCoins → setBalance). Pre-inlining those file-mates is what + // replaces the agent's "trace fail → search SendCoins → node SendCoins + // → trace again" fan-out. + const NEIGHBOR_LINES = 40; + const NEIGHBOR_CHARS = 1200; + const NEIGHBOR_K = 5; + const fileSiblings = (anchor: Node): Node[] => { + // Functions and methods in the same file as the anchor, excluding + // the anchor itself and anything we've already inlined. Sort by + // distance from the anchor's startLine so the closest symbols come + // first (the flow is usually adjacent in the file). + const sameFile = cg + .getNodesByKind('function') + .filter((n) => n.filePath === anchor.filePath) + .concat( + cg.getNodesByKind('method').filter((n) => n.filePath === anchor.filePath), + ); + return sameFile + .filter((n) => n.id !== anchor.id && !inlinedBodies.has(n.id)) + .sort((a, b) => + Math.abs(a.startLine - anchor.startLine) - Math.abs(b.startLine - anchor.startLine), + ) + .slice(0, NEIGHBOR_K); + }; + const renderSiblings = (label: string, siblings: Node[]) => { + if (siblings.length === 0) return; + lines.push(`### ${label}`); + for (const sib of siblings) { + lines.push(''); + lines.push(`- \`${sib.name}\` (${sib.filePath}:${sib.startLine}-${sib.endLine})`); + inlineBody(sib, NEIGHBOR_LINES, NEIGHBOR_CHARS); + } + lines.push(''); + }; + renderSiblings( + `Other functions in \`${end.filePath}\` (the flow that the dynamic-dispatch hop reaches — bodies inlined)`, + fileSiblings(end), + ); + + lines.push( + '> Endpoint bodies + the other functions in the destination\'s file are inlined above. Together they typically cover the missing dynamic-dispatch boundary (interface-method calls like `k.Keeper.SendCoins` that static parsing can\'t follow). **No further codegraph_node / codegraph_callers / codegraph_callees / Read / Grep is needed for any symbol already shown here** — call them again only if you need to walk DEEPER than what is inlined.', + ); + return this.textResult(this.truncateOutput(lines.join('\n') + fromMatches.note + toMatches.note)); } const lines: string[] = [ @@ -1649,11 +2093,52 @@ export class ToolHandler { } // Only include files that have entry points or nodes directly connected to entry points - const relevantFiles = [...fileGroups.entries()].filter(([, group]) => group.score >= 3); + let relevantFiles = [...fileGroups.entries()].filter(([, group]) => group.score >= 3); // Extract query terms for relevance checking const queryTerms = query.toLowerCase().split(/\s+/).filter(t => t.length >= 3); + // Test/spec/icon/i18n file detector — used both for the pre-sort hard + // filter (tiny tier) and the comparator deprioritization (all tiers). + const isLowValue = (p: string) => { + const lp = p.toLowerCase(); + return ( + /\/(tests?|__tests?__|spec)\//.test(lp) || + /_test\.go$/.test(lp) || + /(?:^|\/)test_[^/]+\.py$/.test(lp) || + /_test\.py$/.test(lp) || + /_spec\.rb$/.test(lp) || + /_test\.rb$/.test(lp) || + /\.(test|spec)\.[jt]sx?$/.test(lp) || + /(test|spec|tests)\.(java|kt|scala)$/.test(lp) || + /(tests?|spec)\.cs$/.test(lp) || + /tests?\.swift$/.test(lp) || + /_test\.dart$/.test(lp) || + /\bicons?\b/.test(lp) || + /\bi18n\b/.test(lp) + ); + }; + + // Tiny-tier hard-exclude: on small projects (`excludeLowValueFiles` + // budget flag), one slipped test/spec file dominates the per-file budget + // (cobra's `command_test.go` displaced `args.go` and contributed ~10KB of + // pure noise to "How does cobra parse commands?"). The sort-step + // deprioritization isn't enough at small N. Skip the hard-exclude when + // the query itself is about tests — that's the legitimate "explore the + // tests" case where the agent does want them. + if (budget.excludeLowValueFiles) { + const queryMentionsTests = /\b(test|tests|testing|spec|verify|verifies)\b/i.test(query); + if (!queryMentionsTests) { + const nonLow = relevantFiles.filter(([p]) => !isLowValue(p)); + // Only apply the hard-filter if we still have at least 2 non-test + // candidates after the cut — otherwise the agent is asking about an + // area where tests are the only signal, and we should not strip them. + if (nonLow.length >= 2) { + relevantFiles = nonLow; + } + } + } + // Sort files: highest relevance first, deprioritize low-value files const sortedFiles = relevantFiles.sort((a, b) => { const aPath = a[0].toLowerCase(); @@ -1670,15 +2155,20 @@ export class ToolHandler { const bRelevant = hasQueryRelevance(bPath, b[1].nodes); if (aRelevant !== bRelevant) return aRelevant ? -1 : 1; - // Deprioritize test files, icon files, and i18n files - const isLowValue = (p: string) => - /\/(tests?|__tests?__|spec)\//i.test(p) || - /\bicons?\b/i.test(p) || - /\bi18n\b/i.test(p); const aLow = isLowValue(aPath); const bLow = isLowValue(bPath); if (aLow !== bLow) return aLow ? 1 : -1; + // Deprioritize generated source (.pb.go / .pulsar.go / _mocks.go / …) — + // the agent rarely needs to see the protobuf scaffold or gomock output + // when asking about the actual flow, and dumping their bodies inflates + // the response (the cosmos Q3 explore otherwise leads with + // `expected_keepers_mocks.go`, displacing the real `tally.go` content + // and forcing the agent to Read tally.go anyway). + const aGen = isGeneratedFile(a[0]); + const bGen = isGeneratedFile(b[0]); + if (aGen !== bGen) return aGen ? 1 : -1; + if (a[1].score !== b[1].score) return b[1].score - a[1].score; return b[1].nodes.length - a[1].nodes.length; }); @@ -2519,12 +3009,21 @@ export class ToolHandler { } if (exactMatches.length > 1) { + // Down-rank generated files (.pb.go, .pulsar.go, _grpc.pb.go, …) + // so a query like "Send" prefers the keeper implementation over + // the protobuf-generated interface stub. Stable sort preserves + // FTS order within each group. See generated-detection.ts. + const ranked = [...exactMatches].sort((a, b) => { + const aGen = isGeneratedFile(a.node.filePath) ? 1 : 0; + const bGen = isGeneratedFile(b.node.filePath) ? 1 : 0; + return aGen - bGen; + }); // Multiple exact matches - pick first, note the others - const picked = exactMatches[0]!.node; - const others = exactMatches.slice(1).map(r => + const picked = ranked[0]!.node; + const others = ranked.slice(1).map(r => `${r.node.name} (${r.node.kind}) at ${r.node.filePath}:${r.node.startLine}` ); - const note = `\n\n> **Note:** ${exactMatches.length} symbols named "${symbol}". Showing results for \`${picked.filePath}:${picked.startLine}\`. Others: ${others.join(', ')}`; + const note = `\n\n> **Note:** ${ranked.length} symbols named "${symbol}". Showing results for \`${picked.filePath}:${picked.startLine}\`. Others: ${others.join(', ')}`; return { node: picked, note }; } @@ -2562,11 +3061,20 @@ export class ToolHandler { return { nodes: [node], note: '' }; } - const locations = exactMatches.map(r => + // Same generated-file down-rank as findSymbol — keeps callers/callees + // /impact aggregation aligned (a query against "Send" returns the + // hand-written implementations before the protobuf scaffold). + const ranked = [...exactMatches].sort((a, b) => { + const aGen = isGeneratedFile(a.node.filePath) ? 1 : 0; + const bGen = isGeneratedFile(b.node.filePath) ? 1 : 0; + return aGen - bGen; + }); + + const locations = ranked.map(r => `${r.node.kind} at ${r.node.filePath}:${r.node.startLine}` ); - const note = `\n\n> **Note:** Aggregated results across ${exactMatches.length} symbols named "${symbol}": ${locations.join(', ')}`; - return { nodes: exactMatches.map(r => r.node), note }; + const note = `\n\n> **Note:** Aggregated results across ${ranked.length} symbols named "${symbol}": ${locations.join(', ')}`; + return { nodes: ranked.map(r => r.node), note }; } /** diff --git a/src/resolution/callback-synthesizer.ts b/src/resolution/callback-synthesizer.ts index c3047569e..def7ff6fe 100644 --- a/src/resolution/callback-synthesizer.ts +++ b/src/resolution/callback-synthesizer.ts @@ -24,6 +24,7 @@ import type { Edge, Node } from '../types'; import type { QueryBuilder } from '../db/queries'; import type { ResolutionContext } from './types'; +import { isGeneratedFile } from '../extraction/generated-detection'; const REGISTRAR_NAME = /^(on[A-Z]\w*|subscribe|addListener|addEventListener|register|watch|listen|addCallback)$/; const DISPATCHER_NAME = /(emit|trigger|notify|dispatch|fire|publish|flush)/i; @@ -337,7 +338,16 @@ function cppOverrideEdges(queries: QueryBuilder): Edge[] { * trace/callees reach the implementation. Over-approximation accepted * (reachability-correct); capped per class, gated to JVM languages. */ -const IFACE_OVERRIDE_LANGS = new Set(['java', 'kotlin']); +// Languages whose static `implements`/`extends` edges should bridge an +// interface (or abstract base) method to the matching concrete-class method. +// The set is "languages with explicit nominal subtyping and a single class +// kind that holds methods" — i.e. the shape this loop expects. Swift and +// Scala fit shape-wise (Swift `protocol`/`class`, Scala `trait`/`class`) +// and are added below; their concrete-side nodes can be a `struct` (Swift) +// or an `object` (Scala) so the loop also iterates those kinds. +const IFACE_OVERRIDE_LANGS = new Set([ + 'java', 'kotlin', 'csharp', 'typescript', 'javascript', 'swift', 'scala', +]); function interfaceOverrideEdges(queries: QueryBuilder): Edge[] { const edges: Edge[] = []; const seen = new Set(); @@ -346,7 +356,12 @@ function interfaceOverrideEdges(queries: QueryBuilder): Edge[] { .getOutgoingEdges(classId, ['contains']) .map((e) => queries.getNodeById(e.target)) .filter((n): n is Node => !!n && n.kind === 'method'); - for (const cls of queries.getNodesByKind('class')) { + // Concrete-side kinds vary by language: `class` covers Java / Kotlin / + // C# / TS / Swift-classes / Scala-classes; `struct` covers Swift value + // types that conform to protocols. Iterate both. + const concreteKinds = ['class', 'struct'] as const; + for (const kind of concreteKinds) { + for (const cls of queries.getNodesByKind(kind)) { const implMethods = methodsOf(cls.id).filter((n) => IFACE_OVERRIDE_LANGS.has(n.language)); if (implMethods.length === 0) continue; for (const sup of queries.getOutgoingEdges(cls.id, ['implements', 'extends'])) { @@ -383,6 +398,116 @@ function interfaceOverrideEdges(queries: QueryBuilder): Edge[] { } } } + } + return edges; +} + +/** + * Go gRPC stub → impl bridge. The protoc-gen-go-grpc codegen emits an + * `UnimplementedXxxServer` struct in `*_grpc.pb.go` carrying one method + * per service RPC; the real handler is a hand-written struct in another + * file (`x/bank/keeper/msg_server.go::msgServer.Send` in cosmos-sdk). + * Go's structural typing means no `implements` edge exists for our + * resolver to follow, so `trace("Send","SendCoins")` lands on the + * empty stub and reports "no path" (validated empirically — the cosmos + * Q1 r1 trace failure that drove this work). + * + * Bridge: for each `UnimplementedXxxServer` whose RPC-method names are + * a SUBSET of some other Go struct's method names, emit `calls` edges + * `stub.method → impl.method` (paired by name). Excludes the gRPC + * internal markers `mustEmbedUnimplementedXxxServer` and + * `testEmbeddedByValue`, and skips candidate impls that themselves + * live in a generated file (their `xxxClient` / sibling stubs would + * otherwise look like impls). + * + * Multiple candidates is allowed and capped at MAX_CALLBACKS_PER_CHANNEL — + * a service often has both a production impl and one or more test + * mocks; linking to all preserves trace utility without false-favoring. + * + * Provenance: `heuristic`, `synthesizedBy: 'go-grpc-stub-impl'`. The + * stub's source line is the wiring site shown in the trace trail. + */ +function goGrpcStubImplEdges(queries: QueryBuilder): Edge[] { + const edges: Edge[] = []; + const seen = new Set(); + + const STUB_RE = /^Unimplemented.*Server$/; + // gRPC internal-helper methods that appear on every Unimplemented*Server; + // not part of the service contract, so exclude when computing the RPC-method + // signature used to match impls. + const isInternalMarker = (n: string) => n.startsWith('mustEmbed') || n === 'testEmbeddedByValue'; + + // Methods directly contained by each Go struct, name-only. Built once. + const methodNamesByStruct = new Map>(); + const methodNodesByStruct = new Map(); + const goStructs: Node[] = []; + for (const s of queries.getNodesByKind('struct')) { + if (s.language !== 'go') continue; + goStructs.push(s); + const ms = queries + .getOutgoingEdges(s.id, ['contains']) + .map((e) => queries.getNodeById(e.target)) + .filter((n): n is Node => !!n && n.kind === 'method'); + methodNodesByStruct.set(s.id, ms); + methodNamesByStruct.set(s.id, new Set(ms.map((m) => m.name))); + } + + for (const stub of goStructs) { + if (!STUB_RE.test(stub.name)) continue; + // The stub MUST live in a generated file — that's what tells us this is + // a protoc-emitted scaffold rather than someone naming a struct + // `UnimplementedXxxServer` by hand. Without this gate we'd also bridge + // such hand-written structs and create misleading edges. + if (!isGeneratedFile(stub.filePath)) continue; + + const stubMethods = (methodNodesByStruct.get(stub.id) ?? []).filter( + (m) => !isInternalMarker(m.name), + ); + if (stubMethods.length === 0) continue; + const stubMethodNames = stubMethods.map((m) => m.name); + + for (const cand of goStructs) { + if (cand.id === stub.id) continue; + // Skip generated-file candidates — they're siblings (msgClient, + // UnsafeMsgServer, …) whose method sets coincidentally match. + if (isGeneratedFile(cand.filePath)) continue; + + const candNames = methodNamesByStruct.get(cand.id); + if (!candNames) continue; + // Subset: every RPC method must exist on the candidate by name. + // Signature-level match would tighten this further, but name-match + // alone already gives one-to-one pairing in real codebases because + // gRPC method-name sets are highly distinctive (Send + MultiSend + + // UpdateParams + SetSendEnabled is unique to bank's MsgServer). + if (!stubMethodNames.every((n) => candNames.has(n))) continue; + + const candMethods = methodNodesByStruct.get(cand.id) ?? []; + let added = 0; + for (const sm of stubMethods) { + if (added >= MAX_CALLBACKS_PER_CHANNEL) break; + for (const cm of candMethods) { + if (added >= MAX_CALLBACKS_PER_CHANNEL) break; + if (cm.name !== sm.name) continue; + const key = `${sm.id}>${cm.id}`; + if (seen.has(key)) continue; + seen.add(key); + edges.push({ + source: sm.id, + target: cm.id, + kind: 'calls', + line: sm.startLine, + provenance: 'heuristic', + metadata: { + synthesizedBy: 'go-grpc-stub-impl', + via: cm.name, + registeredAt: `${cm.filePath}:${cm.startLine}`, + }, + }); + added++; + } + } + } + } return edges; } @@ -856,6 +981,7 @@ export function synthesizeCallbackEdges(queries: QueryBuilder, ctx: ResolutionCo const flutterEdges = flutterBuildEdges(queries, ctx); const cppEdges = cppOverrideEdges(queries); const ifaceEdges = interfaceOverrideEdges(queries); + const goGrpcEdges = goGrpcStubImplEdges(queries); const rnEventEdgesList = rnEventEdges(ctx); const fabricNativeEdges = fabricNativeImplEdges(ctx); const mybatisEdges = mybatisJavaXmlEdges(queries); @@ -871,6 +997,7 @@ export function synthesizeCallbackEdges(queries: QueryBuilder, ctx: ResolutionCo ...flutterEdges, ...cppEdges, ...ifaceEdges, + ...goGrpcEdges, ...rnEventEdgesList, ...fabricNativeEdges, ...mybatisEdges,