0 0 11 Python

fix: remove repetition_penalty from vLLM services (fixes #2950 CUDA assert)

- repetition_penalty is incompatible with enable_prompt_embeds=True in vLLM
- Removed from both serve_realtime_ws.py and serve_vllm.py
- Added truncate_repetition() post-processing to serve_vllm.py as alternative
- serve_realtime_ws.py already has detect_and_fix_hallucination() for this

Root cause: vLLM's repetition penalty implementation does scatter on prompt
token IDs, but prompt_embeds mode has no token IDs, causing index OOB.

游

游雁 committed 11d ago

b8d736ef8496ee42fac450c6df74ffde64d7b643

Parent: 679f40c