SIGN IN SIGN UP

feat: migrate eval system to vitest-evals (#441)

## Summary

Adds [vitest-evals](https://github.com/getsentry/vitest-evals)
integration for structured, standardized eval runs.

## Architecture

```
beforeAll (per scenario):
  1. Start isolated gateway (temp DB)
  2. Replay session turns through gateway (scripted interceptor)
  3. Run /lore:curate (distillation + curation)
  4. Backfill embeddings
  → Gateway is warmed up with Lore state

Each it() test:
  1. Send QA question through warmed-up gateway (LTM + recall tool)
  2. Get response
  3. FactualityJudge scores against reference answer

afterAll:
  Tear down gateway
```

## New Files

- **`vitest.evals.config.ts`** — Separate Vitest config for evals (600s
test timeout, 30min hook timeout, single-fork, vitest-evals reporter)
- **`packages/core/eval/lore-harness.ts`** — `createHarness()` wrapper
around the Lore gateway with `replayAndWarmup()` for session replay
- **`packages/core/eval/cm1.eval.ts`** — CM-1 scenario (400K inflated)
as vitest-evals tests
- **`packages/core/eval/mega-session.eval.ts`** — 2.3M mega-session as
vitest-evals tests
- **`package.json`** — `bun run evals` script

## What This Gets Us

- **Vitest test runner** — standard `vitest run` output, filtering,
parallelism
- **FactualityJudge** — built-in LLM-as-judge (replaces custom judge.ts)
- **GitHub Actions reporter** — summary + annotations via
`vitest-evals/reporter`
- **Standard interface** — each question is an `it()` test case

## What Stays the Same

- Scenario definitions (`scenarios/*.ts`) — unchanged
- Gateway lifecycle — reused from harness.ts
- Scripted interceptor — reused
- Existing `run.ts` / `harness.ts` — coexist (can be removed later)

## Tests
- 1752 pass, 0 fail (existing tests unaffected — .eval.ts only in eval
config)
- Typecheck clean
B
Burak Yigit Kaya committed
24b3d3ea70293d51f97f1fc97e3854d4cb212873
Parent: 6d650e5
Committed by GitHub <noreply@github.com> on 5/21/2026, 9:22:53 AM