fix: correct Qwen3.5 causal masking, add lazy logits, compiled verify, KOD benchmarks
Previous Qwen3.5 benchmarks were INVALID — bidirectional attention mask bug caused artificially high acceptance rates and degenerate outputs. Fixed by using cache[fa_idx] (KVCache) for create_attention_mask instead of cache[0] (ArraysCache). Also adds: - Lazy logit computation (no speedup — MLX eval overhead) - Accept-all-block path (no speedup — MLX lazy eval handles it) - Compiled full-attention verify (~5% on Qwen3.5) - New benchmark files spec-2.json, spec-with-kod-2.json with correct results - Server --no-think flag for Qwen3.5 enable_thinking=False
C
clandestine.eth committed
e220fb3594b408996d43745406c0b555e0cf625a
Parent: 8284f4d