fix: use Jaccard similarity in dedup to prevent same-domain false UPDATE
ContentSimilarity (bidirectional max) was too sensitive for formulaic scientific records: a Raub butterfly entry sharing the species name and standard phrasing with a Kinabalu entry produced tokenSim=0.5, crossing the UPDATE threshold and replacing the original. Jaccard (|A∩B|/|A∪B|) penalises texts that share domain vocabulary but have many distinct tokens (different facts). Same-domain different-location pairs now score ~0.28, falling below the 0.5 ADD threshold. Genuine one-word-change updates (SQLite→PostgreSQL) still score ~0.6 → UPDATE. ContentSimilarity is unchanged — bidirectional max remains correct for recall and keyword search. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
C
chancsc committed
8dc68a97f112b3d8be07f84ece34902a0836d464
Parent: d70697a
Committed by Claude <noreply@anthropic.com>
on 5/17/2026, 9:58:00 AM