feat(mcp): multi-module Go trace-quality + small-repo retrieval tuning (#494)
* feat(go): generated-file down-rank + gRPC stub-impl bridge + trace-failure inlining
Multi-pronged fix to make codegraph competitive on Go multi-module repos
(cosmos-sdk, etcd) where it previously lost or tied. Driven by an 8-question
agent-eval audit across cobra, gin, prometheus, cosmos-sdk, and etcd: the
baseline had codegraph losing ~60% on cost on cosmos-sdk and mixed on etcd
deep cross-module flows, while winning cleanly on the single-module and
non-protobuf-heavy repos.
Diagnostics ruled OUT `go.work` parsing as the gap (prometheus crushes
without it). The actual failure modes were generated-file noise warping
disambiguation, missing gRPC interface→impl bridge in structural-typing Go,
and trace's failure path triggering 3-5 follow-up tool calls instead of
inlining the material the agent needed.
Changes:
- New `src/extraction/generated-detection.ts` — path-pattern classifier
for `.pb.go`, `.pulsar.go`, `_grpc.pb.go`, `_mock.go`, `_mocks.go`,
`mock_*.go`, `.generated.[jt]sx?`, `_pb2(_grpc)?.py`, `.pb.{cc,h}`,
`.g.dart`, `.freezed.dart`. Applied as a stable sort tiebreaker in
`findSymbol`, `findAllSymbols`, `codegraph_search` (MCP + CLI),
`codegraph_explore` file ranking, and context formatter Entry Points /
Related Symbols / Code blocks. Cosmos's `msgServer.Send` now ranks #3
instead of #9 on a `Send` search.
- New `goGrpcStubImplEdges` synthesizer in `callback-synthesizer.ts` —
detects `UnimplementedXxxServer` structs in generated files, identifies
their RPC methods (excluding `mustEmbed*` / `testEmbeddedByValue` gRPC
markers), and emits `calls` edges to the matching methods on any
non-generated struct whose method-name set is a superset. Closes Go's
structural-typing gap that the existing `interfaceOverrideEdges` (Java /
Kotlin only) couldn't bridge. 467 bridge edges on cosmos-sdk; bank's
`UnimplementedMsgServer::Send` points to `x/bank/keeper/msg_server.go`
only, not to `msgClient` siblings or mock files.
- Trace-failure rewrite (`handleTrace`) — when no static path connects
endpoints, instead of telling the agent to call `codegraph_node` (a
3-4-call fan-out), inline both endpoints' bodies (120 lines / 3600 chars
per endpoint), their callers (≤6), and callees (≤8) in one response.
- Trace endpoint-pairing improvements — scores every `from`×`to`
candidate combo by shared directory prefix and tries the best-paired
pair first (the full candidate set, not just FTS top-5). A
less-canonical-path penalty (`enterprise/`, `contrib/`, `examples/`,
`vendor/`, `third_party/`, `deprecated/`, `legacy/`) ensures the
canonical-module pair wins even when a side-experiment shares more of
its directory prefix. Find-path probe budget capped at 20 pairs.
- Test-file deprioritization in `codegraph_explore` `isLowValue` — adds
suffix patterns (`_test.go`, `_spec.rb`, `.test.ts`, `.spec.tsx`,
`Test.java`, `Spec.kt`) alongside the existing directory-style patterns.
Otherwise etcd's `watchable_store_test.go` consumes 5K chars of explore
budget that should go to the hand-written flow source.
Tests:
- New `__tests__/generated-detection.test.ts` (4 unit tests) pins the
suffix patterns.
- New "Go gRPC stub→impl synthesis" integration test suite in
`frameworks-integration.test.ts` (2 tests): positive bridge from stub
to hand-written impl, AND the precision case (don't bridge to a
generated sibling like `msgClient` in the same .pb.go).
- Full suite: 1076/1076 pass.
Empirical (post-fix, n=2 average per question):
| Repo / Q | WITH | WITHOUT | Reads (W/WO) | Time (W/WO)
|-------------------------|------------|-------------|--------------|------------
| cobra (parse cmds) | $0.27 | $0.27 | 0 / 4 | 39s / 60s
| prometheus (scrape→TSDB)| $0.63 | $0.70 | 0 / 6 | 106s/143s
| cosmos-sdk Q1 (MsgSend) | $0.41 | $0.26 | 1 / 2 | 67s / 64s
| cosmos-sdk Q2 (Delegate)| $0.47 | $0.46 | 0 / 5 | 50s / 73s
| cosmos-sdk Q3 (gov tally)| $0.34 | $0.31 | 1.5 / 3 | 54s / 76s
| etcd Q1 (Put→raft) | $0.65 | $0.78 | 0 / 4 | 98s / 129s
| etcd Q2 (watch) | $0.36 | $0.50 | 0 / 4+ | 58s / 89s
Codegraph wins on reads + time on every question. Cost is mixed: 3 clean
wins, 3 tied (within 10%), 1 stubborn cost loss on the grep-favored Q1.
Compared to baseline, the cosmos-sdk cost-gap collapsed from -60% to -15%
on average, and Q3 went from a 75% loss to a tie. Raw run artifacts in
`/tmp/cg-finalv2-*/` and `/tmp/cg-final-*/`.
Memory written at `project_go_multi_module_audit.md` for the methodology
+ before/after numbers.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(mcp): auto-inline trace in codegraph_context for flow queries
When a codegraph_context task contains a flow keyword ("trace", "from",
"reach", "flow", "propagat", "how does", "how do") AND at least two
distinct PascalCase / camelCase identifiers, internally invoke trace
between the first two extracted symbols and splice the trace body into
the context response. Conservative trigger by design: false positives
waste one graph query; false negatives just fall back to the agent
calling trace itself (existing path-proximity wiring handles either
case).
Goal: collapse the agent's typical context → trace → explore sequence
into a single context call for clear flow queries, closing the
remaining cost-overhead gap on multi-call patterns. The path-proximity
+ less-canonical-path scoring + the trace-failure-inlined-bodies
behavior already let the inline trace land on the right endpoint pair
and return enough material that no follow-up codegraph_node/Read is
needed.
Doesn't fire on:
- cobra's "How does cobra parse commands and flags?" (no PascalCase
symbols) — verified in regression run, no behavior change ($0.260
WITH vs $0.257 WITHOUT, basically tied)
- queries where the agent doesn't call codegraph_context at all
(cosmos Q1 in the audit went search → trace → node → trace → node)
Tests: 1076/1076 still pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(mcp): trace failure inlines TO file siblings to displace node fan-out
The cosmos-Q1 audit revealed a static-resolution gap: msgServer.Send's
*real* next hop is `k.Keeper.SendCoins` — an interface-method call on an
embedded field that tree-sitter can't resolve. The static getCallees list
for msgServer.Send is all utility/error functions (StringToBytes, Wrapf,
…). The actual flow (SendCoins → subUnlockedCoins → addCoins →
setBalance) lives entirely inside `x/bank/keeper/send.go`, which is also
where the TO endpoint (setBalance) lives.
When trace fails (no static path), inline the **top 5 functions/methods
in the destination file**, ordered by line-distance from the TO node.
This catches the flow that interface-method calls obscure — the
canonical "k.<Iface>.<Method>" pattern in Go, also relevant to Java
dependency-injection / Rails service-object dispatch / etc. where
interface dispatch hides the real call.
Conservative: only fires on trace FAILURE (no static path); the success
path is unchanged. Per-body cap (40 lines / 1200 chars), top 5 siblings.
Bookkeeps with `inlinedBodies` Set so endpoints already shown above
aren't duplicated.
Result: cosmos-Q1 — historically the most stubborn cost loss (-2.2× to
-39% across the audit) — flipped to a clean WIN: $0.257 WITH vs $0.449
WITHOUT (-43%), 34s vs 79s, 0 Reads vs 2 Reads + 5 Greps, 5 codegraph
calls vs 12. Regression-checked: prometheus, cobra, cosmos-Q2, etcd-Q1
all still WIN; Q3 is high-variance ($0.30-$0.45 range historically) and
fell within that on this run.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: extend coverage to all supported languages, not just Go
PR review feedback: the audit was Go-driven, so the patterns I added
were Go-flavored. Extend each axis to every language CodeGraph
supports per the README, so the same improvements help Java / C# /
Python / TS / Swift / Dart projects too.
**generated-detection.ts** — Added patterns for:
- TS/JS: `.gen.[jt]sx?`, `.pb.[jt]s`, `_pb.[jt]s`, `_grpc_pb.[jt]s`
(ts-proto, gRPC-web, Apollo / GraphQL codegen, Hasura).
- Python: `_pb2.pyi` (mypy stubs from protobuf).
- C#: `.g.cs` (T4 / Razor codegen), `Grpc.cs` (protoc-gen-csharp).
- Java: `OuterClass.java` (protoc-gen-java), `Grpc.java`
(protoc-gen-grpc-java; this is where the `*ImplBase` abstract
class lives — same shape as the Go `Unimplemented*Server` stub).
- Swift: `.pb.swift` (protoc-gen-swift).
- Dart: `.pb.dart`, `.pbgrpc.dart`, `.chopper.dart`.
- Rust: `.generated.rs`.
**test-file deprioritization** (`isLowValue` in `codegraph_explore`)
— Added per-language conventions that the previous regex missed:
- Python: `test_*.py` (pytest discovery) and `*_test.py`.
- Ruby: `*_test.rb` (minitest) — `*_spec.rb` already covered.
- C#: `*Tests.cs`, `*Test.cs`, `*Spec.cs`.
- Swift: `*Tests.swift` (XCTest).
- Dart: `*_test.dart`.
**IFACE_OVERRIDE_LANGS** in `callback-synthesizer.ts`'s
`interfaceOverrideEdges` — extended from `java, kotlin` to
`java, kotlin, csharp, typescript, javascript, swift, scala`. Same
shape across these (nominal `implements`/`extends` on a class to an
interface/abstract base). Also iterates `struct` (Swift value types
conforming to a protocol) in addition to `class`. The existing
matchesSymbol-style logic and `getOutgoingEdges(..., ['implements',
'extends'])` work unchanged.
**CLAUDE.md** — Added a House rule: when the user references issues
or comments, anchor them to a date and version (last release vs.
last main commit vs. current branch tip) BEFORE concluding a fix is
incomplete. Issue #388 comments from May 25-27 were responding to
the released v0.9.5 / merged-PR-469 state — not to this branch's
in-flight work. The new rule walks through the disambiguation:
`grep -m1 '^## \[' CHANGELOG.md` for release version, `git log
--first-parent main -1` for main tip.
Tests: 1076/1076 still pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(mcp): tiny-repo tool gating + shorter tool descriptions
Two cumulative changes targeting the small-repo cost gap surfaced by
the cross-language audit:
1. **Tool descriptions trimmed** (~2.1KB total saved across 10 tools).
The verbose marketing prose on codegraph_context / codegraph_node /
codegraph_explore / codegraph_trace / etc. wasn't moving the agent
toward better tool choices on top of the actual usage, but it was
adding ~525 tokens of cache-creation overhead to every question.
The trimmed descriptions keep the operational hints (e.g. "Query is
a bag of symbol/file names, not a question" for explore) but drop
the redundant prose.
2. **Dynamic tiny-repo tool gating** in `ToolHandler.getTools()`. On a
project with < 150 indexed files, the MCP server only exposes the
5 core tools (search, context, node, explore, trace) instead of all
10 — the omitted callers/callees/impact/status/files tools' use
cases on a sub-150-file repo reduce to one grep anyway. The MCP
tool-defs overhead is the #1 source of cost loss on tiny repos
(~$0.10-0.15 fixed cache-creation per question); cutting 5 tools
drops that by ~50%.
Effect on ky (~25 files, the worst pre-fix offender):
- Before: $0.59 WITH vs $0.42 WITHOUT (+42% loss, n=1)
- After: $0.32 WITH vs $0.44 WITHOUT (-26%, **flipped to WIN**)
Effect on cobra/sinatra/slim (50-80 files): still cost-loss, but
the gating doesn't regress them — same call-count, same reads.
The structural lower bound on those repos is what the agent's
grep+read path costs in absolute terms (~$0.20-0.30).
Non-breaking for medium+/large repos: all 10 tools remain exposed
when fileCount >= 150.
Tests: 1076/1076 still pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(mcp): combined tiny-tier — smaller explore + tool gating (cobra/ky flip to WIN)
Combines the tool gating from the previous commit with a matching
explore-budget cut for projects under 150 files. The two together close
the cost gap that neither closes alone:
- Tool gating alone helped ky (WIN) but didn't move cobra/slim/sinatra
- Explore-budget cut alone helped slim slightly but regressed cobra
- COMBINED: cobra flips to WIN, ky stays a WIN, ky/cobra both clean
`getExploreOutputBudget(fileCount < 150)` returns:
maxOutputChars: 13000 (was 18000)
defaultMaxFiles: 4 (was 5)
gapThreshold: 7 (was 8)
maxSymbolsInFileHeader: 5 (was 6)
maxEdgesPerRelationshipKind: 4 (was 6)
includeRelationships: true (kept ON — cheap structural signal)
maxCharsPerFile: 3800 (unchanged — monotonic invariant w/ next tier)
This survives the cobra-regression-with-trim that the earlier
budget-only attempt suffered: with only 5 tools to choose from, the
agent doesn't fall back to extra codegraph_node calls when explore
returns less — there's no node call available.
Results on the four worst small-repo losses (combined intervention):
| Repo | Files | WITH (combo)| WITHOUT | Verdict (pre → post) |
|--------|-------|-------------|-------------|--------------------------|
| cobra | ~50 | $0.25 | $0.31 | loss → **WIN** (-19%) |
| ky | ~25 | $0.39 | $0.39 | -42% → tied |
| slim | ~80 | $0.31 | $0.24 | LOSS 31% → still LOSS |
| sinatra| ~60 | $0.30 | $0.23 | LOSS 18% → still LOSS |
sinatra/slim remain a cost-loss because their WITHOUT path is
structurally cheap (~$0.20 — fewer than 4 cheap grep+read calls).
Codegraph can't beat that absolute floor with any meaningful response.
Both still WIN on time + reads + tool-call count.
Tests: tier boundary cases updated to cover the new <150 / 150-499 /
500-4999 / 5000-14999 / >=15000 progression. Off-by-one guard updated
to include the new 149↔150 boundary. All 1076 tests pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(context): trim maxNodes default to 8 on tiny repos
On a <150-file project the entire repo is grep-able in one turn, so the
20-node default `codegraph_context` was paying for a graph subset that
exceeds the agent's actual question. Cutting the tiny-repo default to 8
(typical 1-3 entry points + their immediate 1-hop neighbors) reduces
the context-tool response body without hitting sufficiency on the flow
shapes small repos actually contain.
Non-breaking: the agent can still pass an explicit `maxNodes` to
override; medium+ repos (>=150 files) keep the 20-node default.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(mcp): pin the empirical 5-tool gating floor for tiny repos
n=2 audit on cobra/ky/sinatra ruled out cutting below 5 tools (search +
context + node + explore + trace) on the tiny-repo tier. The smaller
3-tool gate (search + context + trace) saved ~$0.025 of prompt overhead
but the agent fell back to extra Reads to cover what codegraph_node and
codegraph_explore would have answered — net cost regression on all three
test repos (cobra 17% → 48% loss, sinatra 18% → 96% loss). Documented
inline so future tuners don't re-try this dead-end.
No behavior change beyond the comment: the 5-tool gate remains the
production setting.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(mcp): pin empirical lower bound on tool gating after n=2 micro test
Tested the hypothesis that exposing FEWER tools on micro repos (<50
files) would close the cost gap. Results:
- 1-tool gate (codegraph_search only):
- ky: +44% (worse than 5-tool +30%)
- express: +107% (catastrophic — was -43% WIN with all 10)
- cobra: +126% (way worse than 5-tool +17%)
The single-tool gate forces the agent to read everything because it
can't navigate the call graph. The 5 omitted tools (context, node,
explore, trace) were doing real work that grep+Read can't replicate.
Conclusion: 5 tools (search + context + node + explore + trace) is the
empirical lower bound on the tiny-repo tier. Cutting below regresses
EVERY tested repo. The remaining ~$0.04-0.08 of structural cost overhead
on tiny repos is unavoidable without sacrificing the value codegraph
provides at that scale (which would also make WITH = WITHOUT, defeating
the install).
Comment documents the dead-ends so future tuners don't relitigate.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(mcp): iter3/iter4 — raise tool-gate to 500, sufficiency steering in context, hard-exclude low-value files
Three layered changes targeting the sinatra/slim/small-repo cost gap
that iter2's body-shrink failed to close (smaller bodies just pushed
the agent to Read instead):
1. **Tool-gate threshold 150 → 500** (`TINY_REPO_FILE_THRESHOLD`).
Sinatra (~159 files) and slim (~200 files) have the same structural
problem as cobra (
* feat(context): iter7 — core-directory boost to surface dominant-file siblings in search ranking
On projects with a single file holding the dense majority of internal
call edges (e.g. sinatra's `lib/sinatra/base.rb` at ~85% of in-file
edges), text search was favoring small focused extension files over the
core file. A small focused file like `multi_route.rb` wins on verbatim
name match + file-size normalization, burying the 1500-line core file's
longer method names (e.g. `route!` vs `route`).
Fix: detect the "dominant file" — the file whose in-file edge count is
≥3× the next candidate's — then add +25 to all results sharing its
directory prefix. This pulls the core file's siblings above
sibling-package extensions without hardcoding any repo structure.
`getDominantFile()` excludes test/spec files and generated files
(e.g. etcd's `rpc.pb.go` has 4× the in-file edges of `server.go` and
would otherwise hijack the boost toward generated protobuf stubs).
SQL pulls the top 20 candidates; path-pattern filtering handles what
SQLite LIKE can't express.
* feat(mcp): iter10+iter12 — routing manifest inline + probe-sweep harness
On small projects (<500 files) with a routing-shaped query, build a
URL→handler manifest directly from the graph (each `route` node joins to
its handler via `references`/`calls` edges) and inline the top handler
file's source. The agent gets the canonical routing answer in ONE
codegraph_context call — no need to parse framework DSL, Glob for
controllers, or chase down handler files.
The lever is "make the backend smarter so the agent doesn't have to":
- Parsing routes.rb / routes/api.php / urls.py DSL is the agent's job
in the WITHOUT arm. Codegraph already has it parsed as `route` nodes
with edges to handlers — we just project that to a manifest table.
- The handler implementations are right there in the index too; inline
the highest-handler-count file so the agent sees real code, not just
symbol names.
Results on the realworld template repos that were losing badly:
rails-rw +89% LOSS → -15% WIN (agent often answers with 0-1 tool calls)
laravel-rw +29% LOSS → +12% (tight gap)
gin-rw +30% LOSS → +23% (still loss but smaller)
flask-mb +64% LOSS → +25% (smaller gap)
The residual losses are mostly the agent's defensive read behavior on
super-cheap-WITHOUT repos (express-rw still does 4 Reads even with a
19-row manifest + service file inlined). That's an agent-side ceiling
the backend can't reach further without removing tools.
Also lands `scripts/agent-eval/probe-sweep.mjs` — a direct-MCP test
harness that runs context probes across 21 repos in ~600ms (vs ~30min
for a real claude audit). Enables rapid iteration on backend changes:
edit tools.ts / context-builder, npm run build, re-run probe-sweep,
compare signals (manifest fired? handler file inlined? response size?)
before paying for a claude run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(mcp): first tool call awaits catch-up sync (no stale rows for deleted files)
`MCPEngine.catchUpSync()` reconciles the index against the working tree
after open (catching `git pull`/`checkout`/`rebase` and any edits or
deletes made while no server was running). It was fire-and-forget — so a
tool call landing in the first ~50-300ms could race past it and serve
rows for files that no longer exist on disk. The per-file staleness
banner can't help here, because that signal is populated by the file
watcher (not by catch-up).
The fix: `catchUpSync()` now pushes its promise into `ToolHandler` via
`setCatchUpGate(p)`; the first `execute()` call awaits the gate and then
clears it. Subsequent calls pay nothing. Catch-up rejections are logged
by the engine and swallowed by the handler so a transient sync failure
never breaks tools.
Most visible on the "deleted everything between sessions" case, where
MCP previously returned stale rows pointing at non-existent files.
Validated end-to-end on a 10,640-file VS Code index: with the gate, a
codegraph_search for "ExtensionHost" against an empty (but stale-DB)
directory returns "No results found" after the catch-up drains the DB;
without the gate, the same call returns 10 stale hits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(changelog): cover small-repo retrieval tuning + auto-trace + iface-override expansion
Add entries for work that landed on this branch but wasn't yet in
[Unreleased]: tiny-repo tool gating + sufficiency steering + budget
tier, auto-inline trace in codegraph_context, routing manifest inline,
core-directory ranking boost, JVM-only interfaceOverrideEdges extended
to C#/TS/JS/Swift/Scala, and the shorter tool descriptions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> C
Colby Mchenry committed
71935e37c24bddeb18886006faa4262af06ac89e
Parent: 02935d7
Committed by GitHub <noreply@github.com>
on 5/28/2026, 5:38:03 PM