Spokes.wiki Search Graph Growth About

llm-inference-wiki

Defined Term mechanism updated Mon Jun 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Token Sampling

Token sampling is the final per-step choice in llm-inference‘s decode phase: given the model’s raw scores for every vocabulary token, decide which token to emit next logits-softmax-sampling-walkthrough.

The pipeline

  1. Logits — one raw real-valued score per vocabulary token.
  2. Softmax — normalizes logits into a probability distribution summing to 1.
  3. Temperature — scales logits before softmax: < 1 sharpens toward top tokens (more deterministic), > 1 flattens (more random).
  4. Truncation strategy — restrict the candidate set before sampling:
    • Top-k: keep the k most probable tokens (k = 1 ⇒ greedy decoding).
    • Top-p (nucleus): keep the smallest set whose cumulative probability reaches p; the cutoff adapts to the model’s confidence rather than a fixed count.

The trade-off

The knobs interpolate between consistency (low temperature, top-k = 1 / greedy) and creativity/diversity (higher temperature, higher top-p). There is no single “correct” setting — it is task-dependent.

Where it sits

Sampling is cheap and runs after the heavy attention compute of each decode step; it does not interact with the kv-cache or continuous-batching optimizations, which concern how the logits are produced efficiently, not which token is then chosen.