bench: add SWE-lite 5-way validation results + functional validators
Adds the deep-SWE benchmark harness, raw patches/logs/traces, and the sandbox functional-validation suite for the 5-way comparison (codedb, graphify, codegraph, leanctx, baseline) across 5 big-repo feature tasks. Functional validation re-ranks the variants by correctness, inverting the cost-only ranking: codedb 35/41 (85.4%), leanctx 23/29 (79.3%), graphify 22/31 (71.0%), codegraph 15/25 (60.0%), baseline 12/23 (52.2%). codedb has the highest functional pass rate and zero broken patches; two of the five largest fastapi-HEAD patches (baseline, leanctx) are broken. - validators/gen_summary.py regenerates summary.json from raw results (excludes *_err diagnostic checks); summary.json + VALIDATION.md now agree - run_validation.sh reads SANDBOX_API_KEY from env (no hardcoded credential) - .gitignore: ignore logs/ Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
J
justrach committed
e89e110a695ec64de2d3083b31644011457e55eb
Parent: 2e8b668