Defined Term mechanism updated Mon Jun 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Continuous Batching

Continuous batching (a.k.a. in-flight batching) is the serving-layer optimization of llm-inference that keeps a GPU busy when many users issue requests of differing lengths continuous-batching-serving.

The problem it fixes

Static batching groups requests into a fixed batch padded to the longest sequence; short requests idle until the longest in the batch finishes, wasting GPU cycles on padding tokens.

The mechanism

Two parts working together:

Dynamic scheduling — the instant a sequence finishes it frees its slot and the next queued prompt is admitted on the same step, so the batch is continuously refilled.
Ragged batching — all in-flight tokens are concatenated into a single unpadded row, with a block-diagonal attention mask ensuring sequences don’t attend across each other.

Result

Eliminating padding and idle slots gave ≈6.5× throughput in the source’s demo (9.54 s vs. 61.80 s, same hardware). It is the multi-request counterpart to the single-stream kv-cache optimization, and it manages one KV cache per in-flight sequence. (Caveat: single tutorial benchmark, not a rigorous evaluation.)

Continuous Batching

The problem it fixes

The mechanism

Result

Linked from