Stochastic gradient descent (SGD) & Adam
SGD is gradient-descent that “replaces the actual gradient … by an estimate … calculated from a randomly selected subset of the data” — the workhorse of deep-learning training, trading exactness for “faster iterations.” Its adaptive descendants are the most-used optimizers on Earth. Source: Wikipedia.
The lineage
- Momentum — accumulates past gradients (“a particle traveling through parameter space … incurs acceleration from the gradient”), damping oscillation.
- AdaGrad (2011) → RMSProp (2012) — per-parameter adaptive learning rates from a running average of recent gradient magnitudes.
- Adam (2014) — combines momentum + per-parameter adaptation (moving averages of gradients and squared gradients). “Approaches to optimization in 2023 are dominated by Adam-derived optimizers” — TensorFlow/PyTorch “largely only include Adam-derived optimizers.”
Why it matters here
This is the optimizer the rest of the AI stack actually runs on — the bridge to
../llm-providers-wiki / ../llm-inference-wiki (training the very models elsewhere in the hub). It also
reframes exploration-vs-exploitation: SGD’s gradient noise is a cheap, implicit exploration that
helps escape sharp minima — a different answer than a population. Still local, still needs gradients
(cf. the black-box metaheuristics and model-based bayesian-optimization).
Related
gradient-descent · convex-optimization · bayesian-optimization · metaheuristic-optimization · exploration-vs-exploitation