Defined Term domain updated Mon Jun 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

LLM Inference

LLM inference is the run-time process of turning a prompt into generated text with an already-trained model — distinct from training. It is the umbrella concept for this wiki. Across the founding sources it decomposes into three layers:

Execution phases — prefill (process the whole prompt in parallel) then decode (generate tokens one at a time, autoregressively) prefill-decode-kv-cache.
Token selection — at each decode step, convert logits to a distribution via softmax and pick a token via token-sampling logits-softmax-sampling-walkthrough.
Serving — run many users’ requests on shared hardware efficiently via continuous-batching continuous-batching-serving.

The cost structure that drives everything

Inference is memory- and attention-bound at decode time: decode is sequential and each new token must attend to all prior tokens. The kv-cache is the central optimization that makes decode cheap per step (avoids recomputing past keys/values); continuous-batching is the central optimization that makes the GPU efficient across concurrent requests. Sampling sits on top of this machinery as the final per-step choice.

Relation to the wiki

This page is the spine the founding three sources hang from — see synthesis for how the sampling / execution / serving layers fit into one pipeline and the open questions between them.

LLM Inference

The cost structure that drives everything

Relation to the wiki

Linked from