Table of Contents

Optimizers

Survey Papers

First-Order Optimizers

Modern Deep Learning Optimizers

Provably Linearly-Convergent Optimizers

Adam, and related methods that use exponential moving averages such as RMSProp, Adadelta, and Nadam, can be demonstrated not to converge, even for 1-dimensional convex problems (see Reddi et al 2018 - On the Convergence of Adam and Beyond, follow-up here: Ward et al 2020). The methods below attempt to improve this situation.

Variance Reduction Techniques

Summary: Gower et al 2020 - Variance-Reduced Methods for Machine Learning

Distributed Optimizers

See also Distributed Training

Learned Optimizers

See also Meta-Learning

Other Optimizers

Older Optimizers

For a history of SGD, etc see ML History - Optimization.

Second-Order Optimizers

Second-order optimizers such as Newton's method or quasi-Newton methods enjoy a much, much faster convergence rate than first-order optimizers (a “quadratic” convergence rate). See here and here.

Gradient-Free Optimizers

Also known as black-box optimizers. See also Hyperparameter Tuning.