ml:optimizers
Table of Contents
Optimizers
Survey Papers
- Overviews
- Book Chapters
- Blog posts
- Blog post about Adam, AdamW, and AMSGrad: 2018 - AdamW and Super-convergence is now the fastest way to train neural nets
First-Order Optimizers
Modern Deep Learning Optimizers
- Adagrad: Duchi et al 2011
- Adagrad-Norm: Ward et al 2018 - AdaGrad stepsizes: Sharp convergence over nonconvex landscapes Proofs of convergence for Adagrad-Norm
- Adam (gives O(1/sqrt(T)) theoretical convergence rate, but there is a flaw in the proof, see below)
- Theoretical Analysis of Adam, and its relation to other optimizers: Zou et al 2019 - A Sufficient Condition for Convergences of Adam and RMSProp
- Chen et al 2019 - On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization Proves a O(log(T)/sqrt(T)) convergence rate on non-convex problems for a large class of Adam-type optimizers. Used, for example in the Apollo optimizer proof.
- Ward et al 2020 - A Simple Convergence Proof of Adam and Adagrad Shows that Adam converges with appropriate choices of hyperparameters
- Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD Discusses the effect of momentum on Adam, and why Adam doesn't need gradient clipping.
- Adaptive gradients optimizers can be bad (Adam, Adagrad, etc): Wilson et al 2020 - The Marginal Value of Adaptive Gradient Methods in Machine Learning
- RMSProp (slide 29)
- AdamW: Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization Fixes weight decay in Adam so that it correctly implements L2 regularization. Applied to the Transformer in Yao 2020
- RAdam: Liu et al 2020 - On the Variance of the Adaptive Learning Rate and Beyond. Rectified Adam (RAdam), a variant of Adam, has a better estimate of the variance at the beginning. It doesn't need a warmup time, and shows a consistent improvement over Adam for a wide range of tasks. Compares favorably to Adam with heuristic warmup (i.e. it's better).
- AdaBelief: Zhuang et al - 2021 Don't use. From the Apollo github: “Thus, we suspect that the improvements of AdaBelief over Adam or RAdam mainly come from the fine-tuning of epsilon.”
- Apollo: Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization Actually a quasi-Newton method. Proof shows O(1\sqrt(T)) learning rate (same as Adam, compare to Adam paper).
- Generalized SignSGD: Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD Doesn't assume Lipschitz gradients, which is violated in many deep learning models
- Muon: Jordan et al 2024 - Muon: An optimizer for hidden layers in neural networks. In its update, Muon implicitly uses a spectral norm of the matrices in the network, rather than the “max-of-max” norm of Adam. From Liu 2025: “Weights of neural networks are used as operators on the input space or the hidden space, which are usually (locally) Euclidean (Cesista 2024), so the norm constraint on weights should be an induced operator norm (or spectral norm for weight matrices). In this sense, the norm constraint offered by Muon is more reasonable than that offered by Adam.”
- Background on norms: Bernstein & Newhouse 2024 - Old Optimizer, New Norm: An Anthology
- Applied to larger scale LLM training: Liu et al 2025 - Muon is Scalable for LLM Training
Provably Linearly-Convergent Optimizers
Adam, and related methods that use exponential moving averages such as RMSProp, Adadelta, and Nadam, can be demonstrated not to converge, even for 1-dimensional convex problems (see Reddi et al 2018 - On the Convergence of Adam and Beyond, follow-up here: Ward et al 2020). The methods below attempt to improve this situation.
- Stochastic LBFGS (SLBFGS) Moritz et al 2016 - A Linearly-Convergent Stochastic L-BFGS Algorithm. A competing method: Zhao et al 2017 (Search for “stochastic lbfgs”)
- SAGA Defazio et al 2014
- AMSGrad Reddi et al 2018 - On the Convergence of Adam and Beyond Points out a flaw in the proof for Adam
- JacSketch Gower et al 2018 Talk
Variance Reduction Techniques
Distributed Optimizers
See also Distributed Training
- LAMB optimizer: You et al 2020 - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes Distributed on a single TPU Pod. Reduces training time from 3 days to 76 minutes on a TPU Pod.
Learned Optimizers
See also Meta-Learning
- Chen 2017 - Learning to Learn without Gradient Descent by Gradient Descent Learns a black-box optimizer (gradient-free optimizer). Can be applied to hyperparameter tuning.
Other Optimizers
Older Optimizers
For a history of SGD, etc see ML History - Optimization.
- Quasi-gradient method Nurminskii - 1973
- Stochastic quasi-gradient method, see Ermoliev & Gaivoronski - 1984 The most general version of SGD (that Jeff knows about), works with constrained problems and biased gradient estimates. Reduces to SGD under normal circumstances.
- Quickprop Fahlman's paper has a bunch of variants to SGD
- Rprop Only uses the sign of the gradient when taking steps in SGD, which actually works! Paper: Riedmiller & Braun 1992 - Rprop - A Fast Adaptive Learning Algorithm. See also Generalized SignSGD in this paper: Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD
- SGD with momentum
- Nesterov's accelerated gradient method
Second-Order Optimizers
Second-order optimizers such as Newton's method or quasi-Newton methods enjoy a much, much faster convergence rate than first-order optimizers (a “quadratic” convergence rate). See here and here.
- L-BFGS Highly popular for training convex ML models such as logistic regression. (See comparison Malouf 2002)
- Apollo: Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization A diagonal quasi-Newton method
Gradient-Free Optimizers
Also known as black-box optimizers. See also Hyperparameter Tuning.
- MERT and hypergraph MERT Gradient-free, but not black-box. Used for training statistical machine translation systems. Uses the structure of the linear model to attempt to directly optimize BLEU (can in principle be used for any metric).
- Li & Lin 2007 - Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent Directly optimizes 0/1 loss. Looks very similar to MERT but for 0/1 loss, need to check the difference.
- Chen 2017 - Learning to Learn without Gradient Descent by Gradient Descent Learns a black-box optimizer (gradient-free optimizer). Can be applied to hyperparameter tuning.
Related Pages
ml/optimizers.txt · Last modified: 2025/03/26 20:02 by jmflanig