Token Sampling
Token sampling is the final per-step choice in llm-inference‘s decode phase: given the model’s raw scores for every vocabulary token, decide which token to emit next logits-softmax-sampling-walkthrough.
The pipeline
- Logits — one raw real-valued score per vocabulary token.
- Softmax — normalizes logits into a probability distribution summing to 1.
- Temperature — scales logits before softmax:
< 1sharpens toward top tokens (more deterministic),> 1flattens (more random). - Truncation strategy — restrict the candidate set before sampling:
- Top-k: keep the
kmost probable tokens (k = 1⇒ greedy decoding). - Top-p (nucleus): keep the smallest set whose cumulative probability reaches
p; the cutoff adapts to the model’s confidence rather than a fixed count.
- Top-k: keep the
The trade-off
The knobs interpolate between consistency (low temperature, top-k = 1 / greedy) and creativity/diversity (higher temperature, higher top-p). There is no single “correct” setting — it is task-dependent.
Where it sits
Sampling is cheap and runs after the heavy attention compute of each decode step; it does not interact with the kv-cache or continuous-batching optimizations, which concern how the logits are produced efficiently, not which token is then chosen.