0 0 0 Python

Sync prompts and defaults with evals repo

LOCOMO:
- Fix reference_date to use last (latest) session instead of first
- Default answerer/judge model to gpt-5

LongMemEval:
- Sync prompts from evals: CONTEXT CHECK, Rule 14 (contradictions),
  conflicting numbers, personalization scan, BIAS CHECK in judge,
  chain-of-thought <judge_thinking> tags, 5-step FINAL CHECK
- Default answerer/judge model to gpt-5

BEAM:
- Fix question types to match actual dataset (6 were hallucinated)
- Sync prompts from evals: chronological memory sorting, ISO dates
- Sort memories by created_at before answer generation
- Default answerer/judge model to gpt-5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Soumil Rathi committed 2mo ago

bd063eea04de4f8a19927beea155afa094a01905

Parent: 37cccf8