Sync prompts and defaults with evals repo
LOCOMO: - Fix reference_date to use last (latest) session instead of first - Default answerer/judge model to gpt-5 LongMemEval: - Sync prompts from evals: CONTEXT CHECK, Rule 14 (contradictions), conflicting numbers, personalization scan, BIAS CHECK in judge, chain-of-thought <judge_thinking> tags, 5-step FINAL CHECK - Default answerer/judge model to gpt-5 BEAM: - Fix question types to match actual dataset (6 were hallucinated) - Sync prompts from evals: chronological memory sorting, ISO dates - Sort memories by created_at before answer generation - Default answerer/judge model to gpt-5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
S
Soumil Rathi committed
bd063eea04de4f8a19927beea155afa094a01905
Parent: 37cccf8