Differences

This shows you the differences between two versions of the page.

--- ml:optimizers [2023/07/07 20:52] – [Modern Deep Learning Optimizers] jmflanig
+++ ml:optimizers [2025/03/26 20:02] (current) – [Second-Order Optimizers] jmflanig
@@ Line 3: / Line 3: @@
 ===== Survey Papers =====
   * Introduction: [[https://arxiv.org/pdf/1609.04747.pdf|Ruder 2016 - An Overview of Gradient Descent Optimization Algorithms]] [[https://ruder.io/optimizing-gradient-descent/index.html|blog post]]
-  * [[https://arxiv.org/pdf/1606.04838.pdf|Bottou et al 2016 - Optimization Methods for Large-Scale Machine Learning]]
+  * **Overviews**
-  * [[https://www.deeplearningbook.org/contents/optimization.html|Deep Learning Chapter 8: Training Deep Models]]
+    * [[https://arxiv.org/pdf/1606.04838.pdf|Bottou et al 2016 - Optimization Methods for Large-Scale Machine Learning]]
-  * Blog post: [[https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2|Optimization For Training Deep Models Part I]]
+    * **[[https://arxiv.org/pdf/1906.06821|Sun et al 2019 - A Survey of Optimization Methods from a Machine Learning Perspective]]** Very good
-  * Blog post about Adam, AdamW, and AMSGrad: [[https://www.fast.ai/2018/07/02/adam-weight-decay/|2018 - AdamW and Super-convergence is now the fastest way to train neural nets]]
+    * [[https://arxiv.org/pdf/2211.15596|Kashyap 2022 - A survey of deep learning optimizers - first and second order methods]]
+  * **Book Chapters**
+    * [[https://www.deeplearningbook.org/contents/optimization.html|Deep Learning Chapter 8: Training Deep Models]]
+  * **Blog posts**
+    * [[https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2|Optimization For Training Deep Models Part I]]
+    * Blog post about Adam, AdamW, and AMSGrad: [[https://www.fast.ai/2018/07/02/adam-weight-decay/|2018 - AdamW and Super-convergence is now the fastest way to train neural nets]]
 ===== First-Order Optimizers =====
@@ Line 30: / Line 35: @@
   * [[https://arxiv.org/pdf/2006.00719.pdf|Yao et al 2020 - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning]]
   * Generalized SignSGD: [[https://arxiv.org/pdf/2208.11195.pdf|Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD]] Doesn't assume Lipschitz gradients, which is violated in many deep learning models
-  * Lion: [[https://arxiv.org/pdf/2302.06675.pdf|Chen et al 2023 - Symbolic Discovery of Optimization Algorithms]]
+  * **Lion**: [[https://arxiv.org/pdf/2302.06675.pdf|Chen et al 2023 - Symbolic Discovery of Optimization Algorithms]]
+  * **Muon**: [[https://kellerjordan.github.io/posts/muon/|Jordan et al 2024 - Muon: An optimizer for hidden layers in neural networks]]. In its update, Muon implicitly uses a spectral norm of the matrices in the network, rather than the "max-of-max" norm of Adam. From [[https://arxiv.org/pdf/2502.16982|Liu 2025]]: "Weights of neural networks are used as operators on the input space or the hidden space, which are usually (locally) Euclidean (Cesista 2024), so the norm constraint on weights should be an induced operator norm (or spectral norm for weight matrices). In this sense, the norm constraint offered by Muon is more reasonable than that offered by Adam."
+    * Background on norms: [[https://arxiv.org/pdf/2409.20325|Bernstein & Newhouse 2024 - Old Optimizer, New Norm: An Anthology]]
+    * Applied to larger scale LLM training: [[https://arxiv.org/pdf/2502.16982|Liu et al 2025 - Muon is Scalable for LLM Training]]
 ==== Provably Linearly-Convergent Optimizers ====
@@ Line 71: / Line 79: @@
   * [[https://en.wikipedia.org/wiki/Limited-memory_BFGS|L-BFGS]] Highly popular for training convex ML models such as logistic regression.  (See comparison [[https://dl.acm.org/doi/10.3115/1118853.1118871|Malouf 2002]])
   * Apollo: [[https://arxiv.org/pdf/2009.13586.pdf|Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization]] A diagonal quasi-Newton method
+  * [[https://arxiv.org/pdf/2305.14342|Liu et al 2023 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training]]
 ===== Gradient-Free Optimizers =====