Defined Term mechanism source ↗ source url updated Tue Jun 09 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Stochastic gradient descent (SGD) & Adam

SGD is gradient-descent that “replaces the actual gradient … by an estimate … calculated from a randomly selected subset of the data” — the workhorse of deep-learning training, trading exactness for “faster iterations.” Its adaptive descendants are the most-used optimizers on Earth. Source: Wikipedia.

The lineage

Momentum — accumulates past gradients (“a particle traveling through parameter space … incurs acceleration from the gradient”), damping oscillation.
AdaGrad (2011) → RMSProp (2012) — per-parameter adaptive learning rates from a running average of recent gradient magnitudes.
Adam (2014) — combines momentum + per-parameter adaptation (moving averages of gradients and squared gradients). “Approaches to optimization in 2023 are dominated by Adam-derived optimizers” — TensorFlow/PyTorch “largely only include Adam-derived optimizers.”

Why it matters here

This is the optimizer the rest of the AI stack actually runs on — the bridge to ../llm-providers-wiki / ../llm-inference-wiki (training the very models elsewhere in the hub). It also reframes exploration-vs-exploitation: SGD’s gradient noise is a cheap, implicit exploration that helps escape sharp minima — a different answer than a population. Still local, still needs gradients (cf. the black-box metaheuristics and model-based bayesian-optimization).

gradient-descent · convex-optimization · bayesian-optimization · metaheuristic-optimization · exploration-vs-exploitation

Stochastic gradient descent (SGD) & Adam

The lineage

Why it matters here

Related

Linked from