fix: retry with token-level truncation on ONNX OOM in embedding worker (#457)
## Summary
Fixes ONNX Runtime OOM errors (e.g. `284432024`) that occur when
embedding dense-token content like code, base64, CJK text, or JSON with
short keys, where the chars-per-token ratio is much lower than the ~4
assumed for English prose.
## Problem
The existing character-level pre-truncation (`LOCAL_MAX_CHARS = 16384`,
assuming ~4 chars/token) can produce 6000-8000+ tokens for dense content
— well above the ~4096 safe threshold for ONNX inference. The
`FeatureExtractionPipeline` in transformers.js hardcodes `{ truncation:
true }` without forwarding `max_length`, so its built-in truncation caps
at the model max (8192 tokens), which still OOMs.
## Fix
Instead of aggressively pre-truncating every request, the worker now:
1. **Tries inference at full length first** — normal English text passes
through untouched
2. **On OOM, catches the error** using the existing `isOomError()`
detector
3. **Retries with token-level truncation** using the pipeline's actual
tokenizer (`encode → slice → decode`), progressively halving the limit:
full → 4096 → 2048 → 1024 tokens
This preserves maximum semantic content for normal texts while
adaptively handling dense-token edge cases.
## Changes
- Extracts `runInference()` helper from `processEmbed()` for
retry-ability
- Adds `truncateTexts()` that uses the real tokenizer for exact
content-token counting (`add_special_tokens: false` to exclude
`[CLS]`/`[SEP]`)
- Stashes a reference to `pipe.tokenizer` during pipeline init
- Adds OOM retry loop (1 initial attempt + 3 truncated retries) with
halved token limits
- Logs `console.warn` on each retry for observability
- Updates `LOCAL_MAX_CHARS` comment to reference the new worker-level
defense B
Burak Yigit Kaya committed
9cd94ec67cf3d7f097b4678e24ca8c513a8a50ec
Parent: 59447b8
Committed by GitHub <noreply@github.com>
on 5/22/2026, 7:09:37 PM