Parallelize post-passes, fix mode filtering + semantic edge quality
- Parallelize pass_similarity and pass_semantic_edges via worker pool with thread-local edge buffers; sequential final merge since gbuf is not thread-safe. Adds cbm_lsh_query_into() as a thread-safe variant with caller-provided candidate buffer. - Add activatable profiling subsystem (CBM_PROFILE=1 env or --profile flag) for step-level timing of extract, resolve, corpus build, vector phases, and sqlite dump. Zero overhead when disabled. - Fix cbm_index_mode_t enum mismatch between pipeline.h (FULL=0, MODERATE=1, FAST=2) and discover.h (FULL=0, FAST=1). mode=fast silently no-op'd fast-discovery filtering because discover.c compared against the wrong value. Linux kernel fast mode went 1:40 -> 3:11 as a result; now back to 1:40. Broaden the filter guard to mode != CBM_MODE_FULL so MODERATE and FAST both get aggressive discovery. - Clamp cbm_sem_combined_score output to [0, 1]. The proximity multiplier returns up to 1.10 as a same-file boost which could push the final cosine score above 1.0. - Short-circuit semantic scoring when MinHash jaccard >= 0.95. Exact near-clones are already emitted as SIMILAR_TO edges; returning 0 here avoids flooding SEMANTICALLY_RELATED with cross-service copy-paste boilerplate and frees the edge budget for genuine vocabulary-bridged relations. - Validate search_graph semantic_query as an array of strings and return a clear error for a single-string input. Update the tool description to spell out the requirement explicitly with an example. - JSON-escape user-controlled strings (callee names, call arguments, URL paths, import local_name) in call/argument properties. Introduces cbm_json_escape() in foundation/str_util. - Skip SQLite pending_byte_page (file offset 0x40000000) during raw page writes in sqlite_writer to avoid corrupting databases that cross the 1 GiB boundary. - Migrate pretrained vector blob from UniXcoder (51K tokens) to nomic-embed-code (40856 tokens x 768d int8). Includes the extraction script under scripts/extract_nomic_vectors.py.
M
Martin Vogel committed
8a06d78ac7f7b6c8d254c58967285bb65a5a37c7
Parent: bf70078