SIGN IN SIGN UP

fix: use Jaccard similarity in dedup to prevent same-domain false UPDATE

ContentSimilarity (bidirectional max) was too sensitive for formulaic
scientific records: a Raub butterfly entry sharing the species name and
standard phrasing with a Kinabalu entry produced tokenSim=0.5, crossing
the UPDATE threshold and replacing the original.

Jaccard (|A∩B|/|A∪B|) penalises texts that share domain vocabulary but
have many distinct tokens (different facts). Same-domain different-location
pairs now score ~0.28, falling below the 0.5 ADD threshold. Genuine
one-word-change updates (SQLite→PostgreSQL) still score ~0.6 → UPDATE.

ContentSimilarity is unchanged — bidirectional max remains correct for
recall and keyword search.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
C
chancsc committed
8dc68a97f112b3d8be07f84ece34902a0836d464
Parent: d70697a
Committed by Claude <noreply@anthropic.com> on 5/17/2026, 9:58:00 AM