ml:learning_rate
Table of Contents
Learning Rate
Overviews
- Choosing the learning rate
- Learn rate schedules
- Blog: Learning Rate Schedules Warning: blog post - may contain errors or conceptual misunderstandings.
Learning Rate Schedule
- Convergence conditions To guarantee convergence to a (local) optimum, the learning rate schedule should satisfy certain conditions, see convergence conditions below.
- Large-initial learning rate (see Regularization - Learning Rate Schedule)
- Learning rate annealing Halve (or reduce) the learning rate at prespecified intervals, or under certain conditions (talked about, for example here)
- Linear decay with linear-warm up (slanted triangular). Proposed in Howard & Ruder 2018 as “slanted triangular learning rates”
- Cyclic learning rates
- Smith 2018 - A Disciplined Approach to Neural Network Hyper-parameters (see section 4 for an overview of cyclic learning rates)
- Plateau Learning Rate
- Decrease the learning rate when the objective reaches a plateau. See PyTorch - ReduceLROnPlateau
- Other Schedules
Papers
Warm-up
Warm-up was originally proposed to handle training with very large batches for SGD (Goyal et al., 2017; Gotmare et al., 2019; Bernstein et al., 2018; Xiao et al., 2017).
- Liu et al 2019 argues that warm-up is a variance reduction technique in the early stages of learning, when the second-order derivative hasn't been estimated properly yet.
- Xiong et al 2020 argues that warm-up is necessary for post-norm transformers (the usual transformer) because post-norm transformers have unstable gradients. They argue that pre-norm transformers don't have this problem, don't need warm-up, and are much easier to train with comparable performance when trained without warm-up.
Automatically Setting the Learning Rate
- Rolinek & Martius 2018 - L4: Practical Loss-based Stepsize Adaptation for Deep Learning Uses a linear approximation and a target loss value to pick the step size. For cross-entropy, could use 0 as the target value. Similar to LRTuner.
- Chandra et al 2019 - Gradient Descent: The Ultimate Optimizer Stacked hyper-optimizers
- Iyer et al 2021 - LRTuner: A Learning Rate Tuner for Deep Neural Networks Uses a quadratic approximation in the direction of descent to pick the step size. Seems to work well. Similar to L4.
Parameter-Free Optimization
Optimization algorithms that don't have a stepsize or hyperparameters.
Convergence Conditions
- SGD
- For stochastic gradient descent, optimization theory says the step sizes should satisfy the conditions $\sum_t \alpha_t > \infty$ and $\sum_t \alpha_t^2 < \infty$ (see for example Bottou 1991). A common choice that satisfies these conditions stepsizes that decay as $1/t$. Stepsizes that satisfy these conditions were very common in machine learning and deep learning (for example here).
- For an analysis of stepsize schedules of SGD on non-convex problems, see Gower et al 2021 - SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation
- Adam
- In the original Adam paper, the proof of convergence assumes stepsizes that decay as $1/\sqrt{t}$ (see the sentence above Theorem 4.1). (Note: there's a flaw in the proof, see optimizers and Zou 2019, which was corrected here. So $1/\sqrt{t}$ should work for reasonable hyperparameters.)
- For the Transformer, people often use a different choice, such as linear decay with linear warmup, which was used in BERT.
Software
Related Pages
- Regularization - Learning Rate Schedule.
ml/learning_rate.txt · Last modified: 2024/02/06 00:31 by jmflanig