====== Optimizers ======

===== Survey Papers =====
  * Introduction: [[https://arxiv.org/pdf/1609.04747.pdf|Ruder 2016 - An Overview of Gradient Descent Optimization Algorithms]] [[https://ruder.io/optimizing-gradient-descent/index.html|blog post]]
  * **Overviews**
    * [[https://arxiv.org/pdf/1606.04838.pdf|Bottou et al 2016 - Optimization Methods for Large-Scale Machine Learning]]
    * **[[https://arxiv.org/pdf/1906.06821|Sun et al 2019 - A Survey of Optimization Methods from a Machine Learning Perspective]]** Very good
    * [[https://arxiv.org/pdf/2211.15596|Kashyap 2022 - A survey of deep learning optimizers - first and second order methods]]
  * **Book Chapters**
    * [[https://www.deeplearningbook.org/contents/optimization.html|Deep Learning Chapter 8: Training Deep Models]]
  * **Blog posts**
    * [[https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2|Optimization For Training Deep Models Part I]]
    * Blog post about Adam, AdamW, and AMSGrad: [[https://www.fast.ai/2018/07/02/adam-weight-decay/|2018 - AdamW and Super-convergence is now the fastest way to train neural nets]]
===== First-Order Optimizers =====

==== Modern Deep Learning Optimizers ====
  * Adagrad: [[https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf|Duchi et al 2011]]
     * Adagrad-Norm: [[https://www.jmlr.org/papers/volume21/18-352/18-352.pdf|Ward et al 2018 - AdaGrad stepsizes: Sharp convergence over nonconvex landscapes]]  Proofs of convergence for Adagrad-Norm
  * [[https://arxiv.org/pdf/1212.5701.pdf|Adadelta]]
  * [[https://arxiv.org/pdf/1412.6980.pdf|Adam]] (gives O(1/sqrt(T)) theoretical convergence rate, but there is a flaw in the proof, see below) 
     * Theoretical Analysis of Adam, and its relation to other optimizers: [[https://openaccess.thecvf.com/content_CVPR_2019/papers/Zou_A_Sufficient_Condition_for_Convergences_of_Adam_and_RMSProp_CVPR_2019_paper.pdf|Zou et al 2019 - A Sufficient Condition for Convergences of Adam and RMSProp]]
     * [[https://arxiv.org/pdf/1705.07774.pdf|Balles & Hennig 2017 - Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients]]
     * [[https://arxiv.org/pdf/1808.02941.pdf|Chen et al 2019 - On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization]] Proves a O(log(T)/sqrt(T)) convergence rate on non-convex problems for a large class of Adam-type optimizers.  Used, for example in the Apollo optimizer proof.
     * [[https://arxiv.org/pdf/2003.02395.pdf|Ward et al 2020 - A Simple Convergence Proof of Adam and Adagrad]] Shows that Adam converges with appropriate choices of hyperparameters
     * [[https://arxiv.org/pdf/2208.11195.pdf|Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD]] Discusses the effect of momentum on Adam, and why Adam doesn't need gradient clipping.
  * **Adaptive gradients optimizers can be bad (Adam, Adagrad, etc)**: [[https://arxiv.org/pdf/1705.08292.pdf|Wilson et al 2020 - The Marginal Value of Adaptive Gradient Methods in Machine Learning]]
  * [[http://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf|RMSProp]] (slide 29)
  * Nadam [[https://openreview.net/pdf?id=OM0jvwB8jIp57ZJjtNEZ|Dozat 2016 - Incorporating Nesterov Momentum into Adam]]
  * AdamW: [[https://arxiv.org/pdf/1711.05101.pdf|Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization]] Fixes weight decay in Adam so that it correctly implements L2 regularization.  Applied to the Transformer in [[https://arxiv.org/pdf/2006.00719.pdf|Yao 2020]]
  * [[https://arxiv.org/pdf/1907.08610.pdf|Zhang et al 2019 - Lookahead Optimizer: k steps forward, 1 step back]]
  * RAdam: [[https://arxiv.org/pdf/1908.03265.pdf|Liu et al 2020 - On the Variance of the Adaptive Learning Rate and Beyond]]. Rectified Adam (RAdam), a variant of Adam, has a better estimate of the variance at the beginning.  It doesn't need a warmup time, and shows a consistent improvement over Adam for a wide range of tasks.  Compares favorably to Adam with heuristic warmup (i.e. it's better).
  * EAdam: [[https://arxiv.org/pdf/2011.02150.pdf|Yuan 2020 - EAdam Optimizer: How \eps Impact Adam]]
  * AdaBelief: [[https://arxiv.org/pdf/2010.07468.pdf|Zhuang et al - 2021]] Don't use. From the [[https://github.com/XuezheMax/apollo|Apollo github]]: "Thus, we suspect that the improvements of AdaBelief over Adam or RAdam mainly come from the fine-tuning of epsilon."
  * Apollo: [[https://arxiv.org/pdf/2009.13586.pdf|Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization]] Actually a quasi-Newton method. Proof shows O(1\sqrt(T)) learning rate (same as Adam, compare to Adam paper).
  * [[https://arxiv.org/pdf/2006.00719.pdf|Yao et al 2020 - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning]]
  * Generalized SignSGD: [[https://arxiv.org/pdf/2208.11195.pdf|Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD]] Doesn't assume Lipschitz gradients, which is violated in many deep learning models
  * **Lion**: [[https://arxiv.org/pdf/2302.06675.pdf|Chen et al 2023 - Symbolic Discovery of Optimization Algorithms]]
  * **Muon**: [[https://kellerjordan.github.io/posts/muon/|Jordan et al 2024 - Muon: An optimizer for hidden layers in neural networks]]. In its update, Muon implicitly uses a spectral norm of the matrices in the network, rather than the "max-of-max" norm of Adam. From [[https://arxiv.org/pdf/2502.16982|Liu 2025]]: "Weights of neural networks are used as operators on the input space or the hidden space, which are usually (locally) Euclidean (Cesista 2024), so the norm constraint on weights should be an induced operator norm (or spectral norm for weight matrices). In this sense, the norm constraint offered by Muon is more reasonable than that offered by Adam."
    * Background on norms: [[https://arxiv.org/pdf/2409.20325|Bernstein & Newhouse 2024 - Old Optimizer, New Norm: An Anthology]]
    * Applied to larger scale LLM training: [[https://arxiv.org/pdf/2502.16982|Liu et al 2025 - Muon is Scalable for LLM Training]]

==== Provably Linearly-Convergent Optimizers ====
Adam, and related methods that use exponential moving averages such as RMSProp, Adadelta, and Nadam, can be demonstrated not to converge, even for 1-dimensional convex problems (see [[https://openreview.net/pdf?id=ryQu7f-RZ|Reddi et al 2018 - On the Convergence of Adam and Beyond]], follow-up here: [[https://arxiv.org/pdf/2003.02395.pdf|Ward et al 2020]]).  The methods below attempt to improve this situation.
  * Stochastic LBFGS (SLBFGS) [[http://proceedings.mlr.press/v51/moritz16.pdf|Moritz et al 2016 - A Linearly-Convergent Stochastic L-BFGS Algorithm]]. A competing method: [[https://arxiv.org/pdf/1704.00116.pdf|Zhao et al 2017]] (Search for "stochastic lbfgs")
  * SAGA [[https://arxiv.org/pdf/1407.0202.pdf|Defazio et al 2014]]
  * AMSGrad [[https://openreview.net/pdf?id=ryQu7f-RZ|Reddi et al 2018 - On the Convergence of Adam and Beyond]] Points out a flaw in the proof for Adam
  * JacSketch [[https://arxiv.org/pdf/1805.02632.pdf|Gower et al 2018]] [[https://www.youtube.com/watch?v=gjgEck0zU7w|Talk]]


==== Variance Reduction Techniques =====
Summary: [[https://arxiv.org/pdf/2010.00892.pdf|Gower et al 2020 - Variance-Reduced Methods for Machine Learning]]
  * [[https://arxiv.org/pdf/1812.04529.pdf|Aaron & Bottou 2018 - On the Ineffectiveness of Variance Reduced Optimization for Deep Learning]]

==== Distributed Optimizers =====
See also [[Distributed Training]]
  * LAMB optimizer: [[https://arxiv.org/pdf/1904.00962.pdf|You et al 2020 - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes]] Distributed on a single TPU Pod.  Reduces training time from 3 days to 76 minutes on a TPU Pod.

==== Learned Optimizers =====
See also [[Meta-Learning]]
  * [[https://arxiv.org/pdf/1606.04474.pdf|Andrychowicz et al 2016 - Learning to Learn by Gradient Descent by Gradient Descent]]
  * [[http://proceedings.mlr.press/v70/chen17e/chen17e.pdf|Chen 2017 - Learning to Learn without Gradient Descent by Gradient Descent]] Learns a black-box optimizer (gradient-free optimizer).  Can be applied to hyperparameter tuning.
  * [[https://arxiv.org/pdf/2012.14905.pdf|Kirsch & Schmidhuber 2021 - Meta Learning Backpropagation And Improving It]]

==== Other Optimizers =====
  * [[https://arxiv.org/pdf/1611.00756.pdf|Carmon et al 2017 - Accelerated Methods for Non-Convex Optimization]] Duchi's paper

==== Older Optimizers ====
For a history of SGD, etc see [[ml:history_of_ml#optimization_prior_to_1980|ML History - Optimization]].
  * [[https://en.wikipedia.org/wiki/Stochastic_gradient_descent|SGD]] [[https://projecteuclid.org/euclid.aoms/1177729586|Robins & Monroe 1952]] Classic overview in [[https://leon.bottou.org/publications/pdf/mlss-2003.pdf|Bottou 2003]]
  * Quasi-gradient method [[https://link.springer.com/article/10.1007/BF01068677|Nurminskii - 1973]]
  * [[http://pure.iiasa.ac.at/id/eprint/1759/1/WP-81-002.pdf|Stochastic quasi-gradient method]], see [[https://web.archive.org/web/20171126210238/https://core.ac.uk/download/pdf/33893934.pdf|Ermoliev & Gaivoronski - 1984]]  The most general version of SGD (that Jeff knows about), works with constrained problems and biased gradient estimates.  Reduces to SGD under normal circumstances.
  * [[https://en.wikipedia.org/wiki/Quickprop|Quickprop]] Fahlman's paper has a bunch of variants to SGD
  * [[https://en.wikipedia.org/wiki/Rprop|Rprop]] Only uses the sign of the gradient when taking steps in SGD, which actually works! Paper: {{papers:rprop_paper.pdf|Riedmiller & Braun 1992 - Rprop - A Fast Adaptive Learning Algorithm}}. See also Generalized SignSGD in this paper: [[https://arxiv.org/pdf/2208.11195.pdf|Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD]]
  * SGD with momentum
  * Nesterov's accelerated gradient method

===== Second-Order Optimizers =====
Second-order optimizers such as Newton's method or quasi-Newton methods enjoy a much, much faster convergence rate than first-order optimizers (a "quadratic" convergence rate).  See [[https://sites.math.washington.edu/~burke/crs/408/lectures/L10-Rates-of-conv-Newton.pdf|here]] and [[https://en.wikipedia.org/wiki/Newton%27s_method#Proof_of_quadratic_convergence_for_Newton's_iterative_method|here]].
  * [[https://en.wikipedia.org/wiki/Limited-memory_BFGS|L-BFGS]] Highly popular for training convex ML models such as logistic regression.  (See comparison [[https://dl.acm.org/doi/10.3115/1118853.1118871|Malouf 2002]])
  * Apollo: [[https://arxiv.org/pdf/2009.13586.pdf|Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization]] A diagonal quasi-Newton method
  * [[https://arxiv.org/pdf/2305.14342|Liu et al 2023 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training]]

===== Gradient-Free Optimizers =====
Also known as black-box optimizers.  See also [[Hyperparameter Tuning]].
  * [[https://www.aclweb.org/anthology/P03-1021.pdf|MERT]] and [[https://www.aclweb.org/anthology/P09-1019.pdf|hypergraph MERT]] Gradient-free, but not black-box. Used for training [[nlp:statistical machine translation]] systems. Uses the structure of the linear model to attempt to directly optimize BLEU (can in principle be used for any metric).
  * [[https://web.njit.edu/~usman/courses/cs732_spring16/ijcnn07rcd.pdf|Li & Lin 2007 - Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent]] Directly optimizes 0/1 loss.  Looks very similar to MERT but for 0/1 loss, need to check the difference.
  * [[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf|Golovin et al 2017 - Google Vizier: A Service for Black-Box Optimization]]
  * [[http://proceedings.mlr.press/v70/chen17e/chen17e.pdf|Chen 2017 - Learning to Learn without Gradient Descent by Gradient Descent]] Learns a black-box optimizer (gradient-free optimizer).  Can be applied to hyperparameter tuning.

===== Related Pages =====
  * [[NN Training]]
  * [[Gradient Clipping]]
  * [[Optimization]]