Continuous Batching
Continuous batching (a.k.a. in-flight batching) is the serving-layer optimization of llm-inference that keeps a GPU busy when many users issue requests of differing lengths continuous-batching-serving.
The problem it fixes
Static batching groups requests into a fixed batch padded to the longest sequence; short requests idle until the longest in the batch finishes, wasting GPU cycles on padding tokens.
The mechanism
Two parts working together:
- Dynamic scheduling — the instant a sequence finishes it frees its slot and the next queued prompt is admitted on the same step, so the batch is continuously refilled.
- Ragged batching — all in-flight tokens are concatenated into a single unpadded row, with a block-diagonal attention mask ensuring sequences don’t attend across each other.
Result
Eliminating padding and idle slots gave ≈6.5× throughput in the source’s demo (9.54 s vs. 61.80 s, same hardware). It is the multi-request counterpart to the single-stream kv-cache optimization, and it manages one KV cache per in-flight sequence. (Caveat: single tutorial benchmark, not a rigorous evaluation.)