This is an old revision of the document!

Optimizers

Survey Papers

Introduction: Ruder 2016 - An Overview of Gradient Descent Optimization Algorithms blog post
Bottou et al 2016 - Optimization Methods for Large-Scale Machine Learning
Deep Learning Chapter 8: Training Deep Models
Blog post: Optimization For Training Deep Models Part I
Blog post about Adam, AdamW, and AMSGrad: 2018 - AdamW and Super-convergence is now the fastest way to train neural nets

First-Order Optimizers

Modern Deep Learning Optimizers

Adagrad: Duchi et al 2011
- Adagrad-Norm: Ward et al 2018 - AdaGrad stepsizes: Sharp convergence over nonconvex landscapes Proofs of convergence for Adagrad-Norm
Adadelta
Adam (gives O(1/sqrt(T)) theoretical convergence rate, but there is a flaw in the proof, see below)
- Theoretical Analysis of Adam, and its relation to other optimizers: Zou et al 2019 - A Sufficient Condition for Convergences of Adam and RMSProp
- Balles & Hennig 2017 - Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients
- Chen et al 2019 - On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization Proves a O(log(T)/sqrt(T)) convergence rate on non-convex problems for a large class of Adam-type optimizers. Used, for example in the Apollo optimizer proof.
- Ward et al 2020 - A Simple Convergence Proof of Adam and Adagrad Shows that Adam converges with appropriate choices of hyperparameters
- Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD Discusses the effect of momentum on Adam, and why Adam doesn't need gradient clipping.
Adaptive gradients optimizers can be bad (Adam, Adagrad, etc): Wilson et al 2020 - The Marginal Value of Adaptive Gradient Methods in Machine Learning
RMSProp (slide 29)
Nadam Dozat 2016 - Incorporating Nesterov Momentum into Adam
AdamW: Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization Fixes weight decay in Adam so that it correctly implements L2 regularization. Applied to the Transformer in Yao 2020
RAdam: Liu et al 2020 - On the Variance of the Adaptive Learning Rate and Beyond. Rectified Adam (RAdam), a variant of Adam, has a better estimate of the variance at the beginning. It doesn't need a warmup time, and shows a consistent improvement over Adam for a wide range of tasks. Compares favorably to Adam with heuristic warmup (i.e. it's better).
EAdam: Yuan 2020 - EAdam Optimizer: How \eps Impact Adam
AdaBelief: Zhuang et al - 2021 Don't use. From the Apollo github: “Thus, we suspect that the improvements of AdaBelief over Adam or RAdam mainly come from the fine-tuning of epsilon.”
Apollo: Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization Actually a quasi-Newton method. Proof shows O(1\sqrt(T)) learning rate (same as Adam, compare to Adam paper).
Yao et al 2020 - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Provably Linearly-Convergent Optimizers

Adam, and related methods that use exponential moving averages such as RMSProp, Adadelta, and Nadam, can be demonstrated not to converge, even for 1-dimensional convex problems (see Reddi et al 2018 - On the Convergence of Adam and Beyond, follow-up here: Ward et al 2020). The methods below attempt to improve this situation.

Stochastic LBFGS (SLBFGS) Moritz et al 2016 - A Linearly-Convergent Stochastic L-BFGS Algorithm. A competing method: Zhao et al 2017 (Search for “stochastic lbfgs”)
SAGA Defazio et al 2014
AMSGrad Reddi et al 2018 - On the Convergence of Adam and Beyond Points out a flaw in the proof for Adam
JacSketch Gower et al 2018 Talk

Variance Reduction Techniques

Summary: Gower et al 2020 - Variance-Reduced Methods for Machine Learning

Aaron & Bottou 2018 - On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

Distributed Optimizers

Learned Optimizers

Other Optimizers

Carmon et al 2017 - Accelerated Methods for Non-Convex Optimization Duchi's paper

Older Optimizers

For a history of SGD, etc see ML History - Optimization.

SGD Robins & Monroe 1952 Classic overview in Bottou 2003
Quasi-gradient method Nurminskii - 1973
Stochastic quasi-gradient method, see Ermoliev & Gaivoronski - 1984 The most general version of SGD (that Jeff knows about), works with constrained problems and biased gradient estimates. Reduces to SGD under normal circumstances.
Quickprop Fahlman's paper has a bunch of variants to SGD
Rprop Only uses the sign of the gradient when taking steps in SGD, which actually works! Paper: Riedmiller & Braun 1992 - Rprop - A Fast Adaptive Learning Algorithm See also Generalized SignSGD in this paper: Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD
SGD with momentum
Nesterov's accelerated gradient method

Second-Order Optimizers

Second-order optimizers such as Newton's method or quasi-Newton methods enjoy a much, much faster convergence rate than first-order optimizers (a “quadratic” convergence rate). See here and here.

L-BFGS Highly popular for training convex ML models such as logistic regression. (See comparison Malouf 2002)
Apollo: Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization A diagonal quasi-Newton method

Gradient-Free Optimizers

Also known as black-box optimizers. See also Hyperparameter Tuning.

MERT and hypergraph MERT Gradient-free, but not black-box. Used for training statistical machine translation systems. Uses the structure of the linear model to attempt to directly optimize BLEU (can in principle be used for any metric).
Li & Lin 2007 - Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent Directly optimizes 0/1 loss. Looks very similar to MERT but for 0/1 loss, need to check the difference.
Golovin et al 2017 - Google Vizier: A Service for Black-Box Optimization
Chen 2017 - Learning to Learn without Gradient Descent by Gradient Descent Learns a black-box optimizer (gradient-free optimizer). Can be applied to hyperparameter tuning.

NLP Wiki

Table of Contents