How LLMs Choose Their Words: Logits, Softmax and Sampling
A practical, code-first walk-through (MachineLearningMastery, intermediate level) of how an LLM turns its raw output into an actual next token — the final step of llm-inference, the token-sampling stage.
What it covers
- Logits → probabilities. The model emits a raw score (a logit) per vocabulary token; softmax normalizes these into a probability distribution that sums to 1.
- Temperature. A scaling knob applied before softmax:
< 1sharpens the distribution toward the top tokens (more deterministic);> 1flattens it (more random). - Top-k sampling. Keep only the
khighest-probability tokens, renormalize, sample.k = 1degenerates to greedy decoding. - Top-p (nucleus) sampling. Keep the smallest set of tokens whose cumulative
probability reaches
p; the cutoff adapts to the model’s confidence rather than a fixed count.
Concrete example
Uses the prompt “Today’s weather is so ___” over a toy 6-token vocabulary, with PyTorch code for each strategy and plots of how the probability distribution reshapes under each.
Takeaway
“Sampling strategies shape the model’s next-token distribution.” The knobs trade consistency (low temperature, top-k = 1) against creativity/diversity (higher temperature, higher top-p). See token-sampling for the concept page. (Caveat: tutorial-level toy example, not a benchmark.)