SIGN IN SIGN UP

fix: remove repetition_penalty from vLLM services (fixes #2950 CUDA assert)

- repetition_penalty is incompatible with enable_prompt_embeds=True in vLLM
- Removed from both serve_realtime_ws.py and serve_vllm.py
- Added truncate_repetition() post-processing to serve_vllm.py as alternative
- serve_realtime_ws.py already has detect_and_fix_hallucination() for this

Root cause: vLLM's repetition penalty implementation does scatter on prompt
token IDs, but prompt_embeds mode has no token IDs, causing index OOB.
游雁 committed
b8d736ef8496ee42fac450c6df74ffde64d7b643
Parent: 679f40c