Optimizers

Survey Papers

Introduction: Ruder 2016 - An Overview of Gradient Descent Optimization Algorithms blog post
Overviews
Book Chapters
- Deep Learning Chapter 8: Training Deep Models
Blog posts
- Optimization For Training Deep Models Part I
- Blog post about Adam, AdamW, and AMSGrad: 2018 - AdamW and Super-convergence is now the fastest way to train neural nets

First-Order Optimizers

Modern Deep Learning Optimizers

Adagrad: Duchi et al 2011
- Adagrad-Norm: Ward et al 2018 - AdaGrad stepsizes: Sharp convergence over nonconvex landscapes Proofs of convergence for Adagrad-Norm
Adadelta
Adam (gives O(1/sqrt(T)) theoretical convergence rate, but there is a flaw in the proof, see below)
- Theoretical Analysis of Adam, and its relation to other optimizers: Zou et al 2019 - A Sufficient Condition for Convergences of Adam and RMSProp
- Balles & Hennig 2017 - Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients
- Chen et al 2019 - On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization Proves a O(log(T)/sqrt(T)) convergence rate on non-convex problems for a large class of Adam-type optimizers. Used, for example in the Apollo optimizer proof.
- Ward et al 2020 - A Simple Convergence Proof of Adam and Adagrad Shows that Adam converges with appropriate choices of hyperparameters
- Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD Discusses the effect of momentum on Adam, and why Adam doesn't need gradient clipping.
Adaptive gradients optimizers can be bad (Adam, Adagrad, etc): Wilson et al 2020 - The Marginal Value of Adaptive Gradient Methods in Machine Learning
RMSProp (slide 29)
Nadam Dozat 2016 - Incorporating Nesterov Momentum into Adam
AdamW: Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization Fixes weight decay in Adam so that it correctly implements L2 regularization. Applied to the Transformer in Yao 2020
Zhang et al 2019 - Lookahead Optimizer: k steps forward, 1 step back
RAdam: Liu et al 2020 - On the Variance of the Adaptive Learning Rate and Beyond. Rectified Adam (RAdam), a variant of Adam, has a better estimate of the variance at the beginning. It doesn't need a warmup time, and shows a consistent improvement over Adam for a wide range of tasks. Compares favorably to Adam with heuristic warmup (i.e. it's better).
EAdam: Yuan 2020 - EAdam Optimizer: How \eps Impact Adam
AdaBelief: Zhuang et al - 2021 Don't use. From the Apollo github: “Thus, we suspect that the improvements of AdaBelief over Adam or RAdam mainly come from the fine-tuning of epsilon.”
Apollo: Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization Actually a quasi-Newton method. Proof shows O(1\sqrt(T)) learning rate (same as Adam, compare to Adam paper).
Yao et al 2020 - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
Generalized SignSGD: Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD Doesn't assume Lipschitz gradients, which is violated in many deep learning models
Lion: Chen et al 2023 - Symbolic Discovery of Optimization Algorithms
Muon: Jordan et al 2024 - Muon: An optimizer for hidden layers in neural networks. In its update, Muon implicitly uses a spectral norm of the matrices in the network, rather than the “max-of-max” norm of Adam. From Liu 2025: “Weights of neural networks are used as operators on the input space or the hidden space, which are usually (locally) Euclidean (Cesista 2024), so the norm constraint on weights should be an induced operator norm (or spectral norm for weight matrices). In this sense, the norm constraint offered by Muon is more reasonable than that offered by Adam.”
- Background on norms: Bernstein & Newhouse 2024 - Old Optimizer, New Norm: An Anthology
- Applied to larger scale LLM training: Liu et al 2025 - Muon is Scalable for LLM Training

Provably Linearly-Convergent Optimizers

Adam, and related methods that use exponential moving averages such as RMSProp, Adadelta, and Nadam, can be demonstrated not to converge, even for 1-dimensional convex problems (see Reddi et al 2018 - On the Convergence of Adam and Beyond, follow-up here: Ward et al 2020). The methods below attempt to improve this situation.

Stochastic LBFGS (SLBFGS) Moritz et al 2016 - A Linearly-Convergent Stochastic L-BFGS Algorithm. A competing method: Zhao et al 2017 (Search for “stochastic lbfgs”)
SAGA Defazio et al 2014
AMSGrad Reddi et al 2018 - On the Convergence of Adam and Beyond Points out a flaw in the proof for Adam
JacSketch Gower et al 2018 Talk

Variance Reduction Techniques

Summary: Gower et al 2020 - Variance-Reduced Methods for Machine Learning

Aaron & Bottou 2018 - On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

Distributed Optimizers

Learned Optimizers

Other Optimizers

Carmon et al 2017 - Accelerated Methods for Non-Convex Optimization Duchi's paper

Older Optimizers

For a history of SGD, etc see ML History - Optimization.

SGD Robins & Monroe 1952 Classic overview in Bottou 2003
Quasi-gradient method Nurminskii - 1973
Stochastic quasi-gradient method, see Ermoliev & Gaivoronski - 1984 The most general version of SGD (that Jeff knows about), works with constrained problems and biased gradient estimates. Reduces to SGD under normal circumstances.
Quickprop Fahlman's paper has a bunch of variants to SGD
Rprop Only uses the sign of the gradient when taking steps in SGD, which actually works! Paper: Riedmiller & Braun 1992 - Rprop - A Fast Adaptive Learning Algorithm. See also Generalized SignSGD in this paper: Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD
SGD with momentum
Nesterov's accelerated gradient method

Second-Order Optimizers

Second-order optimizers such as Newton's method or quasi-Newton methods enjoy a much, much faster convergence rate than first-order optimizers (a “quadratic” convergence rate). See here and here.

L-BFGS Highly popular for training convex ML models such as logistic regression. (See comparison Malouf 2002)
Apollo: Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization A diagonal quasi-Newton method
Liu et al 2023 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Gradient-Free Optimizers

Also known as black-box optimizers. See also Hyperparameter Tuning.

MERT and hypergraph MERT Gradient-free, but not black-box. Used for training statistical machine translation systems. Uses the structure of the linear model to attempt to directly optimize BLEU (can in principle be used for any metric).
Li & Lin 2007 - Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent Directly optimizes 0/1 loss. Looks very similar to MERT but for 0/1 loss, need to check the difference.
Golovin et al 2017 - Google Vizier: A Service for Black-Box Optimization
Chen 2017 - Learning to Learn without Gradient Descent by Gradient Descent Learns a black-box optimizer (gradient-free optimizer). Can be applied to hyperparameter tuning.

NLP Wiki

Table of Contents