SIGN IN SIGN UP

feat(mcp): multi-module Go trace-quality + small-repo retrieval tuning (#494)

* feat(go): generated-file down-rank + gRPC stub-impl bridge + trace-failure inlining

Multi-pronged fix to make codegraph competitive on Go multi-module repos
(cosmos-sdk, etcd) where it previously lost or tied. Driven by an 8-question
agent-eval audit across cobra, gin, prometheus, cosmos-sdk, and etcd: the
baseline had codegraph losing ~60% on cost on cosmos-sdk and mixed on etcd
deep cross-module flows, while winning cleanly on the single-module and
non-protobuf-heavy repos.

Diagnostics ruled OUT `go.work` parsing as the gap (prometheus crushes
without it). The actual failure modes were generated-file noise warping
disambiguation, missing gRPC interface→impl bridge in structural-typing Go,
and trace's failure path triggering 3-5 follow-up tool calls instead of
inlining the material the agent needed.

Changes:

- New `src/extraction/generated-detection.ts` — path-pattern classifier
  for `.pb.go`, `.pulsar.go`, `_grpc.pb.go`, `_mock.go`, `_mocks.go`,
  `mock_*.go`, `.generated.[jt]sx?`, `_pb2(_grpc)?.py`, `.pb.{cc,h}`,
  `.g.dart`, `.freezed.dart`. Applied as a stable sort tiebreaker in
  `findSymbol`, `findAllSymbols`, `codegraph_search` (MCP + CLI),
  `codegraph_explore` file ranking, and context formatter Entry Points /
  Related Symbols / Code blocks. Cosmos's `msgServer.Send` now ranks #3
  instead of #9 on a `Send` search.

- New `goGrpcStubImplEdges` synthesizer in `callback-synthesizer.ts` —
  detects `UnimplementedXxxServer` structs in generated files, identifies
  their RPC methods (excluding `mustEmbed*` / `testEmbeddedByValue` gRPC
  markers), and emits `calls` edges to the matching methods on any
  non-generated struct whose method-name set is a superset. Closes Go's
  structural-typing gap that the existing `interfaceOverrideEdges` (Java /
  Kotlin only) couldn't bridge. 467 bridge edges on cosmos-sdk; bank's
  `UnimplementedMsgServer::Send` points to `x/bank/keeper/msg_server.go`
  only, not to `msgClient` siblings or mock files.

- Trace-failure rewrite (`handleTrace`) — when no static path connects
  endpoints, instead of telling the agent to call `codegraph_node` (a
  3-4-call fan-out), inline both endpoints' bodies (120 lines / 3600 chars
  per endpoint), their callers (≤6), and callees (≤8) in one response.

- Trace endpoint-pairing improvements — scores every `from`×`to`
  candidate combo by shared directory prefix and tries the best-paired
  pair first (the full candidate set, not just FTS top-5). A
  less-canonical-path penalty (`enterprise/`, `contrib/`, `examples/`,
  `vendor/`, `third_party/`, `deprecated/`, `legacy/`) ensures the
  canonical-module pair wins even when a side-experiment shares more of
  its directory prefix. Find-path probe budget capped at 20 pairs.

- Test-file deprioritization in `codegraph_explore` `isLowValue` — adds
  suffix patterns (`_test.go`, `_spec.rb`, `.test.ts`, `.spec.tsx`,
  `Test.java`, `Spec.kt`) alongside the existing directory-style patterns.
  Otherwise etcd's `watchable_store_test.go` consumes 5K chars of explore
  budget that should go to the hand-written flow source.

Tests:

- New `__tests__/generated-detection.test.ts` (4 unit tests) pins the
  suffix patterns.
- New "Go gRPC stub→impl synthesis" integration test suite in
  `frameworks-integration.test.ts` (2 tests): positive bridge from stub
  to hand-written impl, AND the precision case (don't bridge to a
  generated sibling like `msgClient` in the same .pb.go).
- Full suite: 1076/1076 pass.

Empirical (post-fix, n=2 average per question):

| Repo / Q                | WITH       | WITHOUT     | Reads (W/WO) | Time (W/WO)
|-------------------------|------------|-------------|--------------|------------
| cobra (parse cmds)      | $0.27      | $0.27       | 0 / 4        | 39s / 60s
| prometheus (scrape→TSDB)| $0.63      | $0.70       | 0 / 6        | 106s/143s
| cosmos-sdk Q1 (MsgSend) | $0.41      | $0.26       | 1 / 2        | 67s / 64s
| cosmos-sdk Q2 (Delegate)| $0.47      | $0.46       | 0 / 5        | 50s / 73s
| cosmos-sdk Q3 (gov tally)| $0.34     | $0.31       | 1.5 / 3      | 54s / 76s
| etcd Q1 (Put→raft)      | $0.65      | $0.78       | 0 / 4        | 98s / 129s
| etcd Q2 (watch)         | $0.36      | $0.50       | 0 / 4+       | 58s / 89s

Codegraph wins on reads + time on every question. Cost is mixed: 3 clean
wins, 3 tied (within 10%), 1 stubborn cost loss on the grep-favored Q1.
Compared to baseline, the cosmos-sdk cost-gap collapsed from -60% to -15%
on average, and Q3 went from a 75% loss to a tie. Raw run artifacts in
`/tmp/cg-finalv2-*/` and `/tmp/cg-final-*/`.

Memory written at `project_go_multi_module_audit.md` for the methodology
+ before/after numbers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(mcp): auto-inline trace in codegraph_context for flow queries

When a codegraph_context task contains a flow keyword ("trace", "from",
"reach", "flow", "propagat", "how does", "how do") AND at least two
distinct PascalCase / camelCase identifiers, internally invoke trace
between the first two extracted symbols and splice the trace body into
the context response. Conservative trigger by design: false positives
waste one graph query; false negatives just fall back to the agent
calling trace itself (existing path-proximity wiring handles either
case).

Goal: collapse the agent's typical context → trace → explore sequence
into a single context call for clear flow queries, closing the
remaining cost-overhead gap on multi-call patterns. The path-proximity
+ less-canonical-path scoring + the trace-failure-inlined-bodies
behavior already let the inline trace land on the right endpoint pair
and return enough material that no follow-up codegraph_node/Read is
needed.

Doesn't fire on:
- cobra's "How does cobra parse commands and flags?" (no PascalCase
  symbols) — verified in regression run, no behavior change ($0.260
  WITH vs $0.257 WITHOUT, basically tied)
- queries where the agent doesn't call codegraph_context at all
  (cosmos Q1 in the audit went search → trace → node → trace → node)

Tests: 1076/1076 still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(mcp): trace failure inlines TO file siblings to displace node fan-out

The cosmos-Q1 audit revealed a static-resolution gap: msgServer.Send's
*real* next hop is `k.Keeper.SendCoins` — an interface-method call on an
embedded field that tree-sitter can't resolve. The static getCallees list
for msgServer.Send is all utility/error functions (StringToBytes, Wrapf,
…). The actual flow (SendCoins → subUnlockedCoins → addCoins →
setBalance) lives entirely inside `x/bank/keeper/send.go`, which is also
where the TO endpoint (setBalance) lives.

When trace fails (no static path), inline the **top 5 functions/methods
in the destination file**, ordered by line-distance from the TO node.
This catches the flow that interface-method calls obscure — the
canonical "k.<Iface>.<Method>" pattern in Go, also relevant to Java
dependency-injection / Rails service-object dispatch / etc. where
interface dispatch hides the real call.

Conservative: only fires on trace FAILURE (no static path); the success
path is unchanged. Per-body cap (40 lines / 1200 chars), top 5 siblings.
Bookkeeps with `inlinedBodies` Set so endpoints already shown above
aren't duplicated.

Result: cosmos-Q1 — historically the most stubborn cost loss (-2.2× to
-39% across the audit) — flipped to a clean WIN: $0.257 WITH vs $0.449
WITHOUT (-43%), 34s vs 79s, 0 Reads vs 2 Reads + 5 Greps, 5 codegraph
calls vs 12. Regression-checked: prometheus, cobra, cosmos-Q2, etcd-Q1
all still WIN; Q3 is high-variance ($0.30-$0.45 range historically) and
fell within that on this run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: extend coverage to all supported languages, not just Go

PR review feedback: the audit was Go-driven, so the patterns I added
were Go-flavored. Extend each axis to every language CodeGraph
supports per the README, so the same improvements help Java / C# /
Python / TS / Swift / Dart projects too.

**generated-detection.ts** — Added patterns for:
- TS/JS: `.gen.[jt]sx?`, `.pb.[jt]s`, `_pb.[jt]s`, `_grpc_pb.[jt]s`
  (ts-proto, gRPC-web, Apollo / GraphQL codegen, Hasura).
- Python: `_pb2.pyi` (mypy stubs from protobuf).
- C#: `.g.cs` (T4 / Razor codegen), `Grpc.cs` (protoc-gen-csharp).
- Java: `OuterClass.java` (protoc-gen-java), `Grpc.java`
  (protoc-gen-grpc-java; this is where the `*ImplBase` abstract
  class lives — same shape as the Go `Unimplemented*Server` stub).
- Swift: `.pb.swift` (protoc-gen-swift).
- Dart: `.pb.dart`, `.pbgrpc.dart`, `.chopper.dart`.
- Rust: `.generated.rs`.

**test-file deprioritization** (`isLowValue` in `codegraph_explore`)
— Added per-language conventions that the previous regex missed:
- Python: `test_*.py` (pytest discovery) and `*_test.py`.
- Ruby: `*_test.rb` (minitest) — `*_spec.rb` already covered.
- C#: `*Tests.cs`, `*Test.cs`, `*Spec.cs`.
- Swift: `*Tests.swift` (XCTest).
- Dart: `*_test.dart`.

**IFACE_OVERRIDE_LANGS** in `callback-synthesizer.ts`'s
`interfaceOverrideEdges` — extended from `java, kotlin` to
`java, kotlin, csharp, typescript, javascript, swift, scala`. Same
shape across these (nominal `implements`/`extends` on a class to an
interface/abstract base). Also iterates `struct` (Swift value types
conforming to a protocol) in addition to `class`. The existing
matchesSymbol-style logic and `getOutgoingEdges(..., ['implements',
'extends'])` work unchanged.

**CLAUDE.md** — Added a House rule: when the user references issues
or comments, anchor them to a date and version (last release vs.
last main commit vs. current branch tip) BEFORE concluding a fix is
incomplete. Issue #388 comments from May 25-27 were responding to
the released v0.9.5 / merged-PR-469 state — not to this branch's
in-flight work. The new rule walks through the disambiguation:
`grep -m1 '^## \[' CHANGELOG.md` for release version, `git log
--first-parent main -1` for main tip.

Tests: 1076/1076 still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(mcp): tiny-repo tool gating + shorter tool descriptions

Two cumulative changes targeting the small-repo cost gap surfaced by
the cross-language audit:

1. **Tool descriptions trimmed** (~2.1KB total saved across 10 tools).
   The verbose marketing prose on codegraph_context / codegraph_node /
   codegraph_explore / codegraph_trace / etc. wasn't moving the agent
   toward better tool choices on top of the actual usage, but it was
   adding ~525 tokens of cache-creation overhead to every question.
   The trimmed descriptions keep the operational hints (e.g. "Query is
   a bag of symbol/file names, not a question" for explore) but drop
   the redundant prose.

2. **Dynamic tiny-repo tool gating** in `ToolHandler.getTools()`. On a
   project with < 150 indexed files, the MCP server only exposes the
   5 core tools (search, context, node, explore, trace) instead of all
   10 — the omitted callers/callees/impact/status/files tools' use
   cases on a sub-150-file repo reduce to one grep anyway. The MCP
   tool-defs overhead is the #1 source of cost loss on tiny repos
   (~$0.10-0.15 fixed cache-creation per question); cutting 5 tools
   drops that by ~50%.

   Effect on ky (~25 files, the worst pre-fix offender):
     - Before: $0.59 WITH vs $0.42 WITHOUT (+42% loss, n=1)
     - After:  $0.32 WITH vs $0.44 WITHOUT (-26%, **flipped to WIN**)

   Effect on cobra/sinatra/slim (50-80 files): still cost-loss, but
   the gating doesn't regress them — same call-count, same reads.
   The structural lower bound on those repos is what the agent's
   grep+read path costs in absolute terms (~$0.20-0.30).

   Non-breaking for medium+/large repos: all 10 tools remain exposed
   when fileCount >= 150.

Tests: 1076/1076 still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(mcp): combined tiny-tier — smaller explore + tool gating (cobra/ky flip to WIN)

Combines the tool gating from the previous commit with a matching
explore-budget cut for projects under 150 files. The two together close
the cost gap that neither closes alone:

- Tool gating alone helped ky (WIN) but didn't move cobra/slim/sinatra
- Explore-budget cut alone helped slim slightly but regressed cobra
- COMBINED: cobra flips to WIN, ky stays a WIN, ky/cobra both clean

`getExploreOutputBudget(fileCount < 150)` returns:
  maxOutputChars: 13000     (was 18000)
  defaultMaxFiles:  4       (was 5)
  gapThreshold:     7       (was 8)
  maxSymbolsInFileHeader: 5 (was 6)
  maxEdgesPerRelationshipKind: 4 (was 6)
  includeRelationships: true   (kept ON — cheap structural signal)
  maxCharsPerFile: 3800        (unchanged — monotonic invariant w/ next tier)

This survives the cobra-regression-with-trim that the earlier
budget-only attempt suffered: with only 5 tools to choose from, the
agent doesn't fall back to extra codegraph_node calls when explore
returns less — there's no node call available.

Results on the four worst small-repo losses (combined intervention):

| Repo   | Files | WITH (combo)| WITHOUT     | Verdict (pre → post)     |
|--------|-------|-------------|-------------|--------------------------|
| cobra  | ~50   | $0.25       | $0.31       | loss → **WIN** (-19%)    |
| ky     | ~25   | $0.39       | $0.39       | -42% → tied              |
| slim   | ~80   | $0.31       | $0.24       | LOSS 31% → still LOSS    |
| sinatra| ~60   | $0.30       | $0.23       | LOSS 18% → still LOSS    |

sinatra/slim remain a cost-loss because their WITHOUT path is
structurally cheap (~$0.20 — fewer than 4 cheap grep+read calls).
Codegraph can't beat that absolute floor with any meaningful response.
Both still WIN on time + reads + tool-call count.

Tests: tier boundary cases updated to cover the new <150 / 150-499 /
500-4999 / 5000-14999 / >=15000 progression. Off-by-one guard updated
to include the new 149↔150 boundary. All 1076 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(context): trim maxNodes default to 8 on tiny repos

On a <150-file project the entire repo is grep-able in one turn, so the
20-node default `codegraph_context` was paying for a graph subset that
exceeds the agent's actual question. Cutting the tiny-repo default to 8
(typical 1-3 entry points + their immediate 1-hop neighbors) reduces
the context-tool response body without hitting sufficiency on the flow
shapes small repos actually contain.

Non-breaking: the agent can still pass an explicit `maxNodes` to
override; medium+ repos (>=150 files) keep the 20-node default.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(mcp): pin the empirical 5-tool gating floor for tiny repos

n=2 audit on cobra/ky/sinatra ruled out cutting below 5 tools (search +
context + node + explore + trace) on the tiny-repo tier. The smaller
3-tool gate (search + context + trace) saved ~$0.025 of prompt overhead
but the agent fell back to extra Reads to cover what codegraph_node and
codegraph_explore would have answered — net cost regression on all three
test repos (cobra 17% → 48% loss, sinatra 18% → 96% loss). Documented
inline so future tuners don't re-try this dead-end.

No behavior change beyond the comment: the 5-tool gate remains the
production setting.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(mcp): pin empirical lower bound on tool gating after n=2 micro test

Tested the hypothesis that exposing FEWER tools on micro repos (<50
files) would close the cost gap. Results:

- 1-tool gate (codegraph_search only):
  - ky:    +44% (worse than 5-tool +30%)
  - express: +107% (catastrophic — was -43% WIN with all 10)
  - cobra: +126% (way worse than 5-tool +17%)

The single-tool gate forces the agent to read everything because it
can't navigate the call graph. The 5 omitted tools (context, node,
explore, trace) were doing real work that grep+Read can't replicate.

Conclusion: 5 tools (search + context + node + explore + trace) is the
empirical lower bound on the tiny-repo tier. Cutting below regresses
EVERY tested repo. The remaining ~$0.04-0.08 of structural cost overhead
on tiny repos is unavoidable without sacrificing the value codegraph
provides at that scale (which would also make WITH = WITHOUT, defeating
the install).

Comment documents the dead-ends so future tuners don't relitigate.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(mcp): iter3/iter4 — raise tool-gate to 500, sufficiency steering in context, hard-exclude low-value files

Three layered changes targeting the sinatra/slim/small-repo cost gap
that iter2's body-shrink failed to close (smaller bodies just pushed
the agent to Read instead):

1. **Tool-gate threshold 150 → 500** (`TINY_REPO_FILE_THRESHOLD`).
   Sinatra (~159 files) and slim (~200 files) have the same structural
   problem as cobra (

* feat(context): iter7 — core-directory boost to surface dominant-file siblings in search ranking

On projects with a single file holding the dense majority of internal
call edges (e.g. sinatra's `lib/sinatra/base.rb` at ~85% of in-file
edges), text search was favoring small focused extension files over the
core file. A small focused file like `multi_route.rb` wins on verbatim
name match + file-size normalization, burying the 1500-line core file's
longer method names (e.g. `route!` vs `route`).

Fix: detect the "dominant file" — the file whose in-file edge count is
≥3× the next candidate's — then add +25 to all results sharing its
directory prefix. This pulls the core file's siblings above
sibling-package extensions without hardcoding any repo structure.

`getDominantFile()` excludes test/spec files and generated files
(e.g. etcd's `rpc.pb.go` has 4× the in-file edges of `server.go` and
would otherwise hijack the boost toward generated protobuf stubs).
SQL pulls the top 20 candidates; path-pattern filtering handles what
SQLite LIKE can't express.

* feat(mcp): iter10+iter12 — routing manifest inline + probe-sweep harness

On small projects (<500 files) with a routing-shaped query, build a
URL→handler manifest directly from the graph (each `route` node joins to
its handler via `references`/`calls` edges) and inline the top handler
file's source. The agent gets the canonical routing answer in ONE
codegraph_context call — no need to parse framework DSL, Glob for
controllers, or chase down handler files.

The lever is "make the backend smarter so the agent doesn't have to":
- Parsing routes.rb / routes/api.php / urls.py DSL is the agent's job
  in the WITHOUT arm. Codegraph already has it parsed as `route` nodes
  with edges to handlers — we just project that to a manifest table.
- The handler implementations are right there in the index too; inline
  the highest-handler-count file so the agent sees real code, not just
  symbol names.

Results on the realworld template repos that were losing badly:
  rails-rw  +89% LOSS → -15% WIN  (agent often answers with 0-1 tool calls)
  laravel-rw  +29% LOSS → +12% (tight gap)
  gin-rw    +30% LOSS → +23% (still loss but smaller)
  flask-mb  +64% LOSS → +25% (smaller gap)

The residual losses are mostly the agent's defensive read behavior on
super-cheap-WITHOUT repos (express-rw still does 4 Reads even with a
19-row manifest + service file inlined). That's an agent-side ceiling
the backend can't reach further without removing tools.

Also lands `scripts/agent-eval/probe-sweep.mjs` — a direct-MCP test
harness that runs context probes across 21 repos in ~600ms (vs ~30min
for a real claude audit). Enables rapid iteration on backend changes:
edit tools.ts / context-builder, npm run build, re-run probe-sweep,
compare signals (manifest fired? handler file inlined? response size?)
before paying for a claude run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mcp): first tool call awaits catch-up sync (no stale rows for deleted files)

`MCPEngine.catchUpSync()` reconciles the index against the working tree
after open (catching `git pull`/`checkout`/`rebase` and any edits or
deletes made while no server was running). It was fire-and-forget — so a
tool call landing in the first ~50-300ms could race past it and serve
rows for files that no longer exist on disk. The per-file staleness
banner can't help here, because that signal is populated by the file
watcher (not by catch-up).

The fix: `catchUpSync()` now pushes its promise into `ToolHandler` via
`setCatchUpGate(p)`; the first `execute()` call awaits the gate and then
clears it. Subsequent calls pay nothing. Catch-up rejections are logged
by the engine and swallowed by the handler so a transient sync failure
never breaks tools.

Most visible on the "deleted everything between sessions" case, where
MCP previously returned stale rows pointing at non-existent files.
Validated end-to-end on a 10,640-file VS Code index: with the gate, a
codegraph_search for "ExtensionHost" against an empty (but stale-DB)
directory returns "No results found" after the catch-up drains the DB;
without the gate, the same call returns 10 stale hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): cover small-repo retrieval tuning + auto-trace + iface-override expansion

Add entries for work that landed on this branch but wasn't yet in
[Unreleased]: tiny-repo tool gating + sufficiency steering + budget
tier, auto-inline trace in codegraph_context, routing manifest inline,
core-directory ranking boost, JVM-only interfaceOverrideEdges extended
to C#/TS/JS/Swift/Scala, and the shorter tool descriptions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
C
Colby Mchenry committed
71935e37c24bddeb18886006faa4262af06ac89e
Parent: 02935d7
Committed by GitHub <noreply@github.com> on 5/28/2026, 5:38:03 PM