SIGN IN SIGN UP

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

158577 0 0 Python

[serving] Fix continuous batching JSON response serialization (#45057)

* Fix continuous batching JSON response serialization

Change model_dump_json() to model_dump() to avoid double JSON encoding.
When using continuous batching with stream=false, the response was being
double-encoded as a string instead of returning a proper JSON object.

* add example script eval-job

* fix script

* Add test for continuous batching non-streaming JSON response

Test verifies that non-streaming responses with continuous batching
return proper JSON objects rather than double-encoded JSON strings.
This is a regression test for the fix where model_dump_json() was
changed to model_dump() in the continuous batching response handler.

* fix ci

* Update eval script to use official transformers repo main branch

Changed dependency from personal fork to official huggingface/transformers@main
for production use of the evaluation script.

* add kernels and flash attn 2

* Add continuous batching configuration CLI arguments to serve command

- Add --cb-block-size, --cb-num-blocks, --cb-max-batch-tokens, --cb-max-memory-percent, and --cb-use-cuda-graph flags
- Flags allow users to customize KV cache and performance settings for continuous batching
- Update transformers_serve_cb_eval_job.py to support and pass through CB config arguments
- Update transformers dependency to use NathanHB/transformers@fix-continuous-batching-json-response branch
- All arguments use auto-inference defaults when not specified (backward compatible)

* Add thread lock for manager creation to avoid double manager

* change transformers dep

---------

Co-authored-by: remi-or <[email protected]>
N
Nathan Habib committed
a91232af09f59e2e1c96561901c92e01e238c355
Parent: 3a4a662
Committed by GitHub <[email protected]> on 3/31/2026, 12:56:00 PM