Differences

This shows you the differences between two versions of the page.

--- ml:optimizers [2022/10/21 17:01] – [Provably Linearly-Convergent Optimizers] jmflanig
+++ ml:optimizers [2025/03/26 20:02] (current) – [Second-Order Optimizers] jmflanig
@@ Line 3: / Line 3: @@
 ===== Survey Papers =====
   * Introduction: [[https://arxiv.org/pdf/1609.04747.pdf|Ruder 2016 - An Overview of Gradient Descent Optimization Algorithms]] [[https://ruder.io/optimizing-gradient-descent/index.html|blog post]]
-  * [[https://arxiv.org/pdf/1606.04838.pdf|Bottou et al 2016 - Optimization Methods for Large-Scale Machine Learning]]
+  * **Overviews**
-  * [[https://www.deeplearningbook.org/contents/optimization.html|Deep Learning Chapter 8: Training Deep Models]]
+    * [[https://arxiv.org/pdf/1606.04838.pdf|Bottou et al 2016 - Optimization Methods for Large-Scale Machine Learning]]
-  * Blog post: [[https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2|Optimization For Training Deep Models Part I]]
+    * **[[https://arxiv.org/pdf/1906.06821|Sun et al 2019 - A Survey of Optimization Methods from a Machine Learning Perspective]]** Very good
-  * Blog post about Adam, AdamW, and AMSGrad: [[https://www.fast.ai/2018/07/02/adam-weight-decay/|2018 - AdamW and Super-convergence is now the fastest way to train neural nets]]
+    * [[https://arxiv.org/pdf/2211.15596|Kashyap 2022 - A survey of deep learning optimizers - first and second order methods]]
+  * **Book Chapters**
+    * [[https://www.deeplearningbook.org/contents/optimization.html|Deep Learning Chapter 8: Training Deep Models]]
+  * **Blog posts**
+    * [[https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2|Optimization For Training Deep Models Part I]]
+    * Blog post about Adam, AdamW, and AMSGrad: [[https://www.fast.ai/2018/07/02/adam-weight-decay/|2018 - AdamW and Super-convergence is now the fastest way to train neural nets]]
 ===== First-Order Optimizers =====
@@ Line 23: / Line 28: @@
   * Nadam [[https://openreview.net/pdf?id=OM0jvwB8jIp57ZJjtNEZ|Dozat 2016 - Incorporating Nesterov Momentum into Adam]]
   * AdamW: [[https://arxiv.org/pdf/1711.05101.pdf|Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization]] Fixes weight decay in Adam so that it correctly implements L2 regularization.  Applied to the Transformer in [[https://arxiv.org/pdf/2006.00719.pdf|Yao 2020]]
+  * [[https://arxiv.org/pdf/1907.08610.pdf|Zhang et al 2019 - Lookahead Optimizer: k steps forward, 1 step back]]
   * RAdam: [[https://arxiv.org/pdf/1908.03265.pdf|Liu et al 2020 - On the Variance of the Adaptive Learning Rate and Beyond]]. Rectified Adam (RAdam), a variant of Adam, has a better estimate of the variance at the beginning.  It doesn't need a warmup time, and shows a consistent improvement over Adam for a wide range of tasks.  Compares favorably to Adam with heuristic warmup (i.e. it's better).
   * EAdam: [[https://arxiv.org/pdf/2011.02150.pdf|Yuan 2020 - EAdam Optimizer: How \eps Impact Adam]]
@@ Line 29: / Line 35: @@
   * [[https://arxiv.org/pdf/2006.00719.pdf|Yao et al 2020 - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning]]
   * Generalized SignSGD: [[https://arxiv.org/pdf/2208.11195.pdf|Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD]] Doesn't assume Lipschitz gradients, which is violated in many deep learning models
+  * **Lion**: [[https://arxiv.org/pdf/2302.06675.pdf|Chen et al 2023 - Symbolic Discovery of Optimization Algorithms]]
+  * **Muon**: [[https://kellerjordan.github.io/posts/muon/|Jordan et al 2024 - Muon: An optimizer for hidden layers in neural networks]]. In its update, Muon implicitly uses a spectral norm of the matrices in the network, rather than the "max-of-max" norm of Adam. From [[https://arxiv.org/pdf/2502.16982|Liu 2025]]: "Weights of neural networks are used as operators on the input space or the hidden space, which are usually (locally) Euclidean (Cesista 2024), so the norm constraint on weights should be an induced operator norm (or spectral norm for weight matrices). In this sense, the norm constraint offered by Muon is more reasonable than that offered by Adam."
+    * Background on norms: [[https://arxiv.org/pdf/2409.20325|Bernstein & Newhouse 2024 - Old Optimizer, New Norm: An Anthology]]
+    * Applied to larger scale LLM training: [[https://arxiv.org/pdf/2502.16982|Liu et al 2025 - Muon is Scalable for LLM Training]]
 ==== Provably Linearly-Convergent Optimizers ====
@@ Line 69: / Line 79: @@
   * [[https://en.wikipedia.org/wiki/Limited-memory_BFGS|L-BFGS]] Highly popular for training convex ML models such as logistic regression.  (See comparison [[https://dl.acm.org/doi/10.3115/1118853.1118871|Malouf 2002]])
   * Apollo: [[https://arxiv.org/pdf/2009.13586.pdf|Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization]] A diagonal quasi-Newton method
+  * [[https://arxiv.org/pdf/2305.14342|Liu et al 2023 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training]]
 ===== Gradient-Free Optimizers =====