🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
[serving] Fix continuous batching JSON response serialization (#45057)
* Fix continuous batching JSON response serialization Change model_dump_json() to model_dump() to avoid double JSON encoding. When using continuous batching with stream=false, the response was being double-encoded as a string instead of returning a proper JSON object. * add example script eval-job * fix script * Add test for continuous batching non-streaming JSON response Test verifies that non-streaming responses with continuous batching return proper JSON objects rather than double-encoded JSON strings. This is a regression test for the fix where model_dump_json() was changed to model_dump() in the continuous batching response handler. * fix ci * Update eval script to use official transformers repo main branch Changed dependency from personal fork to official huggingface/transformers@main for production use of the evaluation script. * add kernels and flash attn 2 * Add continuous batching configuration CLI arguments to serve command - Add --cb-block-size, --cb-num-blocks, --cb-max-batch-tokens, --cb-max-memory-percent, and --cb-use-cuda-graph flags - Flags allow users to customize KV cache and performance settings for continuous batching - Update transformers_serve_cb_eval_job.py to support and pass through CB config arguments - Update transformers dependency to use NathanHB/transformers@fix-continuous-batching-json-response branch - All arguments use auto-inference defaults when not specified (backward compatible) * Add thread lock for manager creation to avoid double manager * change transformers dep --------- Co-authored-by: remi-or <[email protected]>
N
Nathan Habib committed
a91232af09f59e2e1c96561901c92e01e238c355
Parent: 3a4a662
Committed by GitHub <[email protected]>
on 3/31/2026, 12:56:00 PM