Learning Rate

Overviews

Choosing the learning rate
- Blog Post: Setting the learning rate of your neural network
Learn rate schedules
- Blog: Learning Rate Schedules Warning: blog post - may contain errors or conceptual misunderstandings.

Learning Rate Schedule

Convergence conditions To guarantee convergence to a (local) optimum, the learning rate schedule should satisfy certain conditions, see convergence conditions below.
Large-initial learning rate (see Regularization - Learning Rate Schedule)
Learning rate annealing Halve (or reduce) the learning rate at prespecified intervals, or under certain conditions (talked about, for example here)
Linear decay with linear-warm up (slanted triangular). Proposed in Howard & Ruder 2018 as “slanted triangular learning rates”
Cyclic learning rates
- Smith 2015 - Cyclical Learning Rates for Training Neural Networks
- Smith & Topin 2017 - Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
- Smith 2018 - A Disciplined Approach to Neural Network Hyper-parameters (see section 4 for an overview of cyclic learning rates)
Plateau Learning Rate
- Decrease the learning rate when the objective reaches a plateau. See PyTorch - ReduceLROnPlateau
Other Schedules
- Loshchilov & Hutter 2016 - SGDR: Stochastic Gradient Descent with Warm Restarts Used by nanoGPT
Warm restarts Loshchilov & Hutter 2016 - SGDR: Stochastic Gradient Descent with Warm Restarts
Batch size Smith et al 2017 - Don't Decay the Learning Rate, Increase the Batch Size

Papers

Warm-up

Warm-up was originally proposed to handle training with very large batches for SGD (Goyal et al., 2017; Gotmare et al., 2019; Bernstein et al., 2018; Xiao et al., 2017).

Liu et al 2019 argues that warm-up is a variance reduction technique in the early stages of learning, when the second-order derivative hasn't been estimated properly yet.
Xiong et al 2020 argues that warm-up is necessary for post-norm transformers (the usual transformer) because post-norm transformers have unstable gradients. They argue that pre-norm transformers don't have this problem, don't need warm-up, and are much easier to train with comparable performance when trained without warm-up.

Automatically Setting the Learning Rate

Rolinek & Martius 2018 - L4: Practical Loss-based Stepsize Adaptation for Deep Learning Uses a linear approximation and a target loss value to pick the step size. For cross-entropy, could use 0 as the target value. Similar to LRTuner.
Chandra et al 2019 - Gradient Descent: The Ultimate Optimizer Stacked hyper-optimizers
Loizou et al 2020 - Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence
Carvalho et al 2021 - Evolving Learning Rate Optimizers for Deep Neural Networks
Jin et al 2021 - AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly
Iyer et al 2021 - LRTuner: A Learning Rate Tuner for Deep Neural Networks Uses a quadratic approximation in the direction of descent to pick the step size. Seems to work well. Similar to L4.
Teng et al 2021 - AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop
Cutkosky et al 2023 - Mechanic: A Learning Rate Tuner

Parameter-Free Optimization

Optimization algorithms that don't have a stepsize or hyperparameters.

Ivgi et al 2023 - DoG is SGD’s Best Friend: A Parameter-Free Dynamic Step Size Schedule

Convergence Conditions

SGD
- For stochastic gradient descent, optimization theory says the step sizes should satisfy the conditions $\sum_t \alpha_t > \infty$ and $\sum_t \alpha_t^2 < \infty$ (see for example Bottou 1991). A common choice that satisfies these conditions stepsizes that decay as $1/t$. Stepsizes that satisfy these conditions were very common in machine learning and deep learning (for example here).
- For an analysis of stepsize schedules of SGD on non-convex problems, see Gower et al 2021 - SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation
Adam
- In the original Adam paper, the proof of convergence assumes stepsizes that decay as $1/\sqrt{t}$ (see the sentence above Theorem 4.1). (Note: there's a flaw in the proof, see optimizers and Zou 2019, which was corrected here. So $1/\sqrt{t}$ should work for reasonable hyperparameters.)
- For the Transformer, people often use a different choice, such as linear decay with linear warmup, which was used in BERT.

Software

Huggingface - Learning Rate Schedules (PyTorch)

Related Pages

NN Training
Regularization - Learning Rate Schedule.

NLP Wiki

Table of Contents