Blog: Learning Rate Schedules Warning: blog post - may contain errors or conceptual misunderstandings.
Learning Rate Schedule
Convergence conditions To guarantee convergence to a (local) optimum, the learning rate schedule should satisfy certain conditions, see convergence conditions below.
Warm-up was originally proposed to handle training with very large batches for SGD (Goyal et al., 2017; Gotmare et al., 2019; Bernstein et al., 2018; Xiao et al., 2017).
Liu et al 2019 argues that warm-up is a variance reduction technique in the early stages of learning, when the second-order derivative hasn't been estimated properly yet.
Xiong et al 2020 argues that warm-up is necessary for post-norm transformers (the usual transformer) because post-norm transformers have unstable gradients. They argue that pre-norm transformers don't have this problem, don't need warm-up, and are much easier to train with comparable performance when trained without warm-up.
For stochastic gradient descent, optimization theory says the step sizes should satisfy the conditions $\sum_t \alpha_t > \infty$ and $\sum_t \alpha_t^2 < \infty$ (see for example Bottou 1991). A common choice that satisfies these conditions stepsizes that decay as $1/t$. Stepsizes that satisfy these conditions were very common in machine learning and deep learning (for example here).
In the original Adam paper, the proof of convergence assumes stepsizes that decay as $1/\sqrt{t}$ (see the sentence above Theorem 4.1). (Note: there's a flaw in the proof, see optimizers and Zou 2019, which was corrected here. So $1/\sqrt{t}$ should work for reasonable hyperparameters.)
For the Transformer, people often use a different choice, such as linear decay with linear warmup, which was used in BERT.