User Tools

Site Tools


ml:learning_rate

Learning Rate

Overviews

Learning Rate Schedule

Papers

Warm-up

Warm-up was originally proposed to handle training with very large batches for SGD (Goyal et al., 2017; Gotmare et al., 2019; Bernstein et al., 2018; Xiao et al., 2017).

  • Liu et al 2019 argues that warm-up is a variance reduction technique in the early stages of learning, when the second-order derivative hasn't been estimated properly yet.
  • Xiong et al 2020 argues that warm-up is necessary for post-norm transformers (the usual transformer) because post-norm transformers have unstable gradients. They argue that pre-norm transformers don't have this problem, don't need warm-up, and are much easier to train with comparable performance when trained without warm-up.

Automatically Setting the Learning Rate

Parameter-Free Optimization

Optimization algorithms that don't have a stepsize or hyperparameters.

Convergence Conditions

  • SGD
  • Adam
    • In the original Adam paper, the proof of convergence assumes stepsizes that decay as $1/\sqrt{t}$ (see the sentence above Theorem 4.1). (Note: there's a flaw in the proof, see optimizers and Zou 2019, which was corrected here. So $1/\sqrt{t}$ should work for reasonable hyperparameters.)
    • For the Transformer, people often use a different choice, such as linear decay with linear warmup, which was used in BERT.

Software

ml/learning_rate.txt · Last modified: 2024/02/06 00:31 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki