Tech Article source ↗ source url updated Mon Jun 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Serving Multiple Users at Once: Continuous Batching

A code-first walk-through (MachineLearningMastery, intermediate; assumes prefill/decode and KV caching) of how an inference server keeps a GPU busy when many users hit it at once — the serving layer of llm-inference. The concept page is continuous-batching.

What it covers

The static-batching problem. Requests are grouped into fixed batches and padded to the longest sequence; short requests sit idle until the batch’s longest finishes — “short requests in a wave idle until the wave’s longest is done.” Wasted GPU on padding.
Continuous (in-flight) batching. Two mechanisms together:
1. Dynamic scheduling — “the moment a sequence finishes it frees its slot, and the next queued prompt is admitted on the SAME step.”
2. Ragged batching — “all in-flight tokens are concatenated into a single unpadded row” with a block-diagonal attention mask that prevents cross-sequence attention.

Concrete example

Working Python over Hugging Face transformers implementing both approaches. The demo reports 9.54 s (continuous) vs. 61.80 s (static) on the same hardware — ≈6.5× faster.

Takeaway

Careful attention-mask design lets multiple sequences pack into a single forward pass, so the GPU spends its cycles on real tokens rather than padding. This is the production counterpart to the single-stream kv-cache optimization. (Caveat: tutorial benchmark on one setup, not a rigorous evaluation.)

Serving Multiple Users at Once: Continuous Batching

What it covers

Concrete example

Takeaway

Linked from