Differences

This shows you the differences between two versions of the page.

--- ml:optimizers [2025/03/06 09:57] – [Survey Papers] jmflanig
+++ ml:optimizers [2025/03/26 20:02] (current) – [Second-Order Optimizers] jmflanig
@@ Line 5: / Line 5: @@
   * **Overviews**
     * [[https://arxiv.org/pdf/1606.04838.pdf|Bottou et al 2016 - Optimization Methods for Large-Scale Machine Learning]]
-    * [[https://arxiv.org/pdf/1906.06821|sun et al 2019 - A Survey of Optimization Methods from a Machine Learning Perspective]] Very good
+    * **[[https://arxiv.org/pdf/1906.06821|Sun et al 2019 - A Survey of Optimization Methods from a Machine Learning Perspective]]** Very good
+    * [[https://arxiv.org/pdf/2211.15596|Kashyap 2022 - A survey of deep learning optimizers - first and second order methods]]
   * **Book Chapters**
     * [[https://www.deeplearningbook.org/contents/optimization.html|Deep Learning Chapter 8: Training Deep Models]]
@@ Line 35: / Line 36: @@
   * Generalized SignSGD: [[https://arxiv.org/pdf/2208.11195.pdf|Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD]] Doesn't assume Lipschitz gradients, which is violated in many deep learning models
   * **Lion**: [[https://arxiv.org/pdf/2302.06675.pdf|Chen et al 2023 - Symbolic Discovery of Optimization Algorithms]]
+  * **Muon**: [[https://kellerjordan.github.io/posts/muon/|Jordan et al 2024 - Muon: An optimizer for hidden layers in neural networks]]. In its update, Muon implicitly uses a spectral norm of the matrices in the network, rather than the "max-of-max" norm of Adam. From [[https://arxiv.org/pdf/2502.16982|Liu 2025]]: "Weights of neural networks are used as operators on the input space or the hidden space, which are usually (locally) Euclidean (Cesista 2024), so the norm constraint on weights should be an induced operator norm (or spectral norm for weight matrices). In this sense, the norm constraint offered by Muon is more reasonable than that offered by Adam."
+    * Background on norms: [[https://arxiv.org/pdf/2409.20325|Bernstein & Newhouse 2024 - Old Optimizer, New Norm: An Anthology]]
+    * Applied to larger scale LLM training: [[https://arxiv.org/pdf/2502.16982|Liu et al 2025 - Muon is Scalable for LLM Training]]
 ==== Provably Linearly-Convergent Optimizers ====
@@ Line 75: / Line 79: @@
   * [[https://en.wikipedia.org/wiki/Limited-memory_BFGS|L-BFGS]] Highly popular for training convex ML models such as logistic regression.  (See comparison [[https://dl.acm.org/doi/10.3115/1118853.1118871|Malouf 2002]])
   * Apollo: [[https://arxiv.org/pdf/2009.13586.pdf|Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization]] A diagonal quasi-Newton method
+  * [[https://arxiv.org/pdf/2305.14342|Liu et al 2023 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training]]
 ===== Gradient-Free Optimizers =====