fix(grounding): per-chunk multi-pass NLI scoring, attach pipeline.ground telemetry
HallmarkServing used to concatenate every retrieved chunk into one context and score each sentence against that one blob. That's fine for Advanced RAG's 5-10 chunks but breaks on Pipeline runs where decompose and rerank easily accumulate 20+ chunks: HHEM's ModernBERT has a fixed input window, the context truncates, and sentences whose evidence lived in the tail get flagged as hallucinated even when the chunks were retrieved correctly (3.9% faithful on "Who are the Cybermen" despite 79-92% word overlap). Switched to scoring each (sentence, chunk) pair individually and taking the max per sentence. Semantics: "did any chunk support this sentence?" The full sentence × chunk grid goes through Hallmark.score_batch, which would OOM Metal at 160 pairs, so the batch is split into sub-batches of 8 pairs each and the NLI forward pass runs per sub-batch. Small enough to fit on an Apple Silicon GPU, still batched enough that throughput is fine (a typical Pipeline run: ~15 forward passes of 8 pairs, ~2s total). Also added [:arcana, :pipeline, :ground, :stop] to the telemetry logger. Grounding was running silently because the logger attach list omitted it, which made it look like grounding never ran when debugging a 3.9% score.
G
George Guimarães committed
29bd21f5029bc8317eec0068928359516c68e9f4
Parent: eb999a0