Serving Multiple Users at Once: Continuous Batching
A code-first walk-through (MachineLearningMastery, intermediate; assumes prefill/decode and KV caching) of how an inference server keeps a GPU busy when many users hit it at once — the serving layer of llm-inference. The concept page is continuous-batching.
What it covers
- The static-batching problem. Requests are grouped into fixed batches and padded to the longest sequence; short requests sit idle until the batch’s longest finishes — “short requests in a wave idle until the wave’s longest is done.” Wasted GPU on padding.
- Continuous (in-flight) batching. Two mechanisms together:
- Dynamic scheduling — “the moment a sequence finishes it frees its slot, and the next queued prompt is admitted on the SAME step.”
- Ragged batching — “all in-flight tokens are concatenated into a single unpadded row” with a block-diagonal attention mask that prevents cross-sequence attention.
Concrete example
Working Python over Hugging Face transformers implementing both approaches. The demo
reports 9.54 s (continuous) vs. 61.80 s (static) on the same hardware — ≈6.5× faster.
Takeaway
Careful attention-mask design lets multiple sequences pack into a single forward pass, so the GPU spends its cycles on real tokens rather than padding. This is the production counterpart to the single-stream kv-cache optimization. (Caveat: tutorial benchmark on one setup, not a rigorous evaluation.)