LLM Inference
LLM inference is the run-time process of turning a prompt into generated text with an already-trained model — distinct from training. It is the umbrella concept for this wiki. Across the founding sources it decomposes into three layers:
- Execution phases — prefill (process the whole prompt in parallel) then decode (generate tokens one at a time, autoregressively) prefill-decode-kv-cache.
- Token selection — at each decode step, convert logits to a distribution via softmax and pick a token via token-sampling logits-softmax-sampling-walkthrough.
- Serving — run many users’ requests on shared hardware efficiently via continuous-batching continuous-batching-serving.
The cost structure that drives everything
Inference is memory- and attention-bound at decode time: decode is sequential and each new token must attend to all prior tokens. The kv-cache is the central optimization that makes decode cheap per step (avoids recomputing past keys/values); continuous-batching is the central optimization that makes the GPU efficient across concurrent requests. Sampling sits on top of this machinery as the final per-step choice.
Relation to the wiki
This page is the spine the founding three sources hang from — see synthesis for how the sampling / execution / serving layers fit into one pipeline and the open questions between them.