Differences

This shows you the differences between two versions of the page.

--- ml:optimizers [2025/03/06 18:40] – [Modern Deep Learning Optimizers] jmflanig
+++ ml:optimizers [2025/03/26 20:02] (current) – [Second-Order Optimizers] jmflanig
@@ Line 79: / Line 79: @@
   * [[https://en.wikipedia.org/wiki/Limited-memory_BFGS|L-BFGS]] Highly popular for training convex ML models such as logistic regression.  (See comparison [[https://dl.acm.org/doi/10.3115/1118853.1118871|Malouf 2002]])
   * Apollo: [[https://arxiv.org/pdf/2009.13586.pdf|Ma 2021 - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization]] A diagonal quasi-Newton method
+  * [[https://arxiv.org/pdf/2305.14342|Liu et al 2023 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training]]
 ===== Gradient-Free Optimizers =====