Differences

This shows you the differences between two versions of the page.

--- ml:optimizers [2025/03/06 18:34] – [Modern Deep Learning Optimizers] jmflanig
+++ ml:optimizers [2025/03/26 20:02] (current) – [Second-Order Optimizers] jmflanig
@@ Line 36: / Line 36: @@
   * Generalized SignSGD: [[https://arxiv.org/pdf/2208.11195.pdf|Crawshaw et al 2022 - Robustness to Unbounded Smoothness of Generalized SignSGD]] Doesn't assume Lipschitz gradients, which is violated in many deep learning models
   * **Lion**: [[https://arxiv.org/pdf/2302.06675.pdf|Chen et al 2023 - Symbolic Discovery of Optimization Algorithms]]
-  * **Muon**: [[https://kellerjordan.github.io/posts/muon/|Jordan et al 2024 - Muon: An optimizer for hidden layers in neural networks]]. From [[https://arxiv.org/pdf/2502.16982|Liu 2025]]: "Weights of neural networks are used as operators on the input space or the hidden space, which are usually (locally) Euclidean (Cesista 2024), so the norm constraint on weights should be an induced operator norm (or spectral norm for weight matrices). In this sense, the norm constraint offered by Muon is more reasonable than that offered by Adam."
+  * **Muon**: [[https://kellerjordan.github.io/posts/muon/|Jordan et al 2024 - Muon: An optimizer for hidden layers in neural networks]]. In its update, Muon implicitly uses a spectral norm of the matrices in the network, rather than the "max-of-max" norm of Adam. From [[https://arxiv.org/pdf/2502.16982|Liu 2025]]: "Weights of neural networks are used as operators on the input space or the hidden space, which are usually (locally) Euclidean (Cesista 2024), so the norm constraint on weights should be an induced operator norm (or spectral norm for weight matrices). In this sense, the norm constraint offered by Muon is more reasonable than that offered by Adam."
-    * Background here: [[https://arxiv.org/pdf/2409.20325|Bernstein & Newhouse 2024 - Old Optimizer, New Norm: An Anthology]]
+    * Background on norms: [[https://arxiv.org/pdf/2409.20325|Bernstein & Newhouse 2024 - Old Optimizer, New Norm: An Anthology]]
     * Applied to larger scale LLM training: [[https://arxiv.org/pdf/2502.16982|Liu et al 2025 - Muon is Scalable for LLM Training]]
@@ Line 79: / Line 79: @@
   * [[https://en.wikipedia.org/wiki/Limited-memory_BFGS|L-BFGS]] Highly popular for training convex ML models such as logistic regression.  (See comparison [[https://dl.acm.org/doi/10.3115/1118853.1118871|Malouf 2002]])
   * Apollo: [[https://arxiv.org/pdf/2009.13586.pdf|Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization]] A diagonal quasi-Newton method
+  * [[https://arxiv.org/pdf/2305.14342|Liu et al 2023 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training]]
 ===== Gradient-Free Optimizers =====