Use a shared work queue instead of static partitioning in matcher
Replace static chunk partitioning (sliceChunks) with a shared atomic counter that workers pull from. This gives natural load balancing; workers that finish chunks quickly grab more work instead of idling. With this change, NumCPU workers suffice (no need for 8x oversubscription), reducing goroutine overhead while improving throughput by 5-22%. Now the performance scales linearly to the number of threads: === query: 'linux' === [all] baseline: 17.12ms current: 14.28ms (1.20x) matches: 179966 (12.79%) [1T] baseline: 136.49ms current: 137.25ms (0.99x) matches: 179966 (12.79%) [2T] baseline: 75.74ms current: 68.75ms (1.10x) matches: 179966 (12.79%) [4T] baseline: 41.16ms current: 34.97ms (1.18x) matches: 179966 (12.79%) [8T] baseline: 32.82ms current: 17.79ms (1.84x) matches: 179966 (12.79%)
J
Junegunn Choi committed
92bfe68c741913cb663e19386269bc28119e0961
Parent: 92dc40e