This is an old revision of the document!
Table of Contents
Optimizers
Survey Papers
- Blog post: Optimization For Training Deep Models Part I
- Blog post about Adam, AdamW, and AMSGrad: 2018 - AdamW and Super-convergence is now the fastest way to train neural nets
First-Order Optimizers
Modern Deep Learning Optimizers
- Adagrad: Duchi et al 2011
- Adagrad-Norm: Ward et al 2018 - AdaGrad stepsizes: Sharp convergence over nonconvex landscapes Proofs of convergence for Adagrad-Norm
- Adam (gives O(1/sqrt(T)) theoretical convergence rate, but there is a flaw in the proof, see below)
- Theoretical Analysis of Adam, and its relation to other optimizers: Zou et al 2019 - A Sufficient Condition for Convergences of Adam and RMSProp
- Chen et al 2019 - On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization Proves a O(log(T)/sqrt(T)) convergence rate on non-convex problems for a large class of Adam-type optimizers. Used, for example in the Apollo optimizer proof.
- Ward et al 2020 - A Simple Convergence Proof of Adam and Adagrad Shows that Adam converges with appropriate choices of hyperparameters
- Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD Discusses the effect of momentum on Adam, and why Adam doesn't need gradient clipping.
- Adaptive gradients optimizers can be bad (Adam, Adagrad, etc): Wilson et al 2020 - The Marginal Value of Adaptive Gradient Methods in Machine Learning
- RMSProp (slide 29)
- AdamW: Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization Fixes weight decay in Adam so that it correctly implements L2 regularization. Applied to the Transformer in Yao 2020
- RAdam: Liu et al 2020 - On the Variance of the Adaptive Learning Rate and Beyond. Rectified Adam (RAdam), a variant of Adam, has a better estimate of the variance at the beginning. It doesn't need a warmup time, and shows a consistent improvement over Adam for a wide range of tasks. Compares favorably to Adam with heuristic warmup (i.e. it's better).
- AdaBelief: Zhuang et al - 2021 Don't use. From the Apollo github: “Thus, we suspect that the improvements of AdaBelief over Adam or RAdam mainly come from the fine-tuning of epsilon.”
- Apollo: Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization Actually a quasi-Newton method. Proof shows O(1\sqrt(T)) learning rate (same as Adam, compare to Adam paper).
Provably Linearly-Convergent Optimizers
Adam, and related methods that use exponential moving averages such as RMSProp, Adadelta, and Nadam, can be demonstrated not to converge, even for 1-dimensional convex problems (see Reddi et al 2018 - On the Convergence of Adam and Beyond, follow-up here: Ward et al 2020). The methods below attempt to improve this situation.
- Stochastic LBFGS (SLBFGS) Moritz et al 2016 - A Linearly-Convergent Stochastic L-BFGS Algorithm. A competing method: Zhao et al 2017 (Search for “stochastic lbfgs”)
- SAGA Defazio et al 2014
- AMSGrad Reddi et al 2018 - On the Convergence of Adam and Beyond Points out a flaw in the proof for Adam
- JacSketch Gower et al 2018 Talk
Variance Reduction Techniques
Distributed Optimizers
See also Distributed Training
- LAMB optimizer: You et al 2020 - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes Distributed on a single TPU Pod. Reduces training time from 3 days to 76 minutes on a TPU Pod.
Learned Optimizers
See also Meta-Learning
- Chen 2017 - Learning to Learn without Gradient Descent by Gradient Descent Learns a black-box optimizer (gradient-free optimizer). Can be applied to hyperparameter tuning.
Other Optimizers
Older Optimizers
For a history of SGD, etc see ML History - Optimization.
- Quasi-gradient method Nurminskii - 1973
- Stochastic quasi-gradient method, see Ermoliev & Gaivoronski - 1984 The most general version of SGD (that Jeff knows about), works with constrained problems and biased gradient estimates. Reduces to SGD under normal circumstances.
- Quickprop Fahlman's paper has a bunch of variants to SGD
- Rprop Only uses the sign of the gradient when taking steps in SGD, which actually works! Paper: Riedmiller & Braun 1992 - Rprop - A Fast Adaptive Learning Algorithm See also Generalized SignSGD in this paper: Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD
- SGD with momentum
- Nesterov's accelerated gradient method
Second-Order Optimizers
Second-order optimizers such as Newton's method or quasi-Newton methods enjoy a much, much faster convergence rate than first-order optimizers (a “quadratic” convergence rate). See here and here.
- L-BFGS Highly popular for training convex ML models such as logistic regression. (See comparison Malouf 2002)
- Apollo: Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization A diagonal quasi-Newton method
Gradient-Free Optimizers
Also known as black-box optimizers. See also Hyperparameter Tuning.
- MERT and hypergraph MERT Gradient-free, but not black-box. Used for training statistical machine translation systems. Uses the structure of the linear model to attempt to directly optimize BLEU (can in principle be used for any metric).
- Li & Lin 2007 - Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent Directly optimizes 0/1 loss. Looks very similar to MERT but for 0/1 loss, need to check the difference.
- Chen 2017 - Learning to Learn without Gradient Descent by Gradient Descent Learns a black-box optimizer (gradient-free optimizer). Can be applied to hyperparameter tuning.