feat: migrate eval system to vitest-evals (#441)
## Summary Adds [vitest-evals](https://github.com/getsentry/vitest-evals) integration for structured, standardized eval runs. ## Architecture ``` beforeAll (per scenario): 1. Start isolated gateway (temp DB) 2. Replay session turns through gateway (scripted interceptor) 3. Run /lore:curate (distillation + curation) 4. Backfill embeddings → Gateway is warmed up with Lore state Each it() test: 1. Send QA question through warmed-up gateway (LTM + recall tool) 2. Get response 3. FactualityJudge scores against reference answer afterAll: Tear down gateway ``` ## New Files - **`vitest.evals.config.ts`** — Separate Vitest config for evals (600s test timeout, 30min hook timeout, single-fork, vitest-evals reporter) - **`packages/core/eval/lore-harness.ts`** — `createHarness()` wrapper around the Lore gateway with `replayAndWarmup()` for session replay - **`packages/core/eval/cm1.eval.ts`** — CM-1 scenario (400K inflated) as vitest-evals tests - **`packages/core/eval/mega-session.eval.ts`** — 2.3M mega-session as vitest-evals tests - **`package.json`** — `bun run evals` script ## What This Gets Us - **Vitest test runner** — standard `vitest run` output, filtering, parallelism - **FactualityJudge** — built-in LLM-as-judge (replaces custom judge.ts) - **GitHub Actions reporter** — summary + annotations via `vitest-evals/reporter` - **Standard interface** — each question is an `it()` test case ## What Stays the Same - Scenario definitions (`scenarios/*.ts`) — unchanged - Gateway lifecycle — reused from harness.ts - Scripted interceptor — reused - Existing `run.ts` / `harness.ts` — coexist (can be removed later) ## Tests - 1752 pass, 0 fail (existing tests unaffected — .eval.ts only in eval config) - Typecheck clean
B
Burak Yigit Kaya committed
24b3d3ea70293d51f97f1fc97e3854d4cb212873
Parent: 6d650e5
Committed by GitHub <noreply@github.com>
on 5/21/2026, 9:22:53 AM