SIGN IN SIGN UP

bench: add SWE-lite 5-way validation results + functional validators

Adds the deep-SWE benchmark harness, raw patches/logs/traces, and the
sandbox functional-validation suite for the 5-way comparison (codedb,
graphify, codegraph, leanctx, baseline) across 5 big-repo feature tasks.

Functional validation re-ranks the variants by correctness, inverting the
cost-only ranking:
  codedb 35/41 (85.4%), leanctx 23/29 (79.3%), graphify 22/31 (71.0%),
  codegraph 15/25 (60.0%), baseline 12/23 (52.2%).
codedb has the highest functional pass rate and zero broken patches; two of
the five largest fastapi-HEAD patches (baseline, leanctx) are broken.

- validators/gen_summary.py regenerates summary.json from raw results
  (excludes *_err diagnostic checks); summary.json + VALIDATION.md now agree
- run_validation.sh reads SANDBOX_API_KEY from env (no hardcoded credential)
- .gitignore: ignore logs/

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
J
justrach committed
e89e110a695ec64de2d3083b31644011457e55eb
Parent: 2e8b668