SIGN IN SIGN UP

Parallelize post-passes, fix mode filtering + semantic edge quality

- Parallelize pass_similarity and pass_semantic_edges via worker pool with
  thread-local edge buffers; sequential final merge since gbuf is not
  thread-safe. Adds cbm_lsh_query_into() as a thread-safe variant with
  caller-provided candidate buffer.

- Add activatable profiling subsystem (CBM_PROFILE=1 env or --profile flag)
  for step-level timing of extract, resolve, corpus build, vector phases,
  and sqlite dump. Zero overhead when disabled.

- Fix cbm_index_mode_t enum mismatch between pipeline.h (FULL=0, MODERATE=1,
  FAST=2) and discover.h (FULL=0, FAST=1). mode=fast silently no-op'd
  fast-discovery filtering because discover.c compared against the wrong
  value. Linux kernel fast mode went 1:40 -> 3:11 as a result; now back to
  1:40. Broaden the filter guard to mode != CBM_MODE_FULL so MODERATE and
  FAST both get aggressive discovery.

- Clamp cbm_sem_combined_score output to [0, 1]. The proximity multiplier
  returns up to 1.10 as a same-file boost which could push the final
  cosine score above 1.0.

- Short-circuit semantic scoring when MinHash jaccard >= 0.95. Exact
  near-clones are already emitted as SIMILAR_TO edges; returning 0 here
  avoids flooding SEMANTICALLY_RELATED with cross-service copy-paste
  boilerplate and frees the edge budget for genuine vocabulary-bridged
  relations.

- Validate search_graph semantic_query as an array of strings and return
  a clear error for a single-string input. Update the tool description
  to spell out the requirement explicitly with an example.

- JSON-escape user-controlled strings (callee names, call arguments,
  URL paths, import local_name) in call/argument properties. Introduces
  cbm_json_escape() in foundation/str_util.

- Skip SQLite pending_byte_page (file offset 0x40000000) during raw page
  writes in sqlite_writer to avoid corrupting databases that cross the
  1 GiB boundary.

- Migrate pretrained vector blob from UniXcoder (51K tokens) to
  nomic-embed-code (40856 tokens x 768d int8). Includes the extraction
  script under scripts/extract_nomic_vectors.py.
M
Martin Vogel committed
8a06d78ac7f7b6c8d254c58967285bb65a5a37c7
Parent: bf70078