====== Learning Rate ======

===== Overviews =====
  * Choosing the learning rate
    * [[https://www.jeremyjordan.me/nn-learning-rate/|Blog Post: Setting the learning rate of your neural network]]
  * Learn rate schedules
    * Blog: [[https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1|Learning Rate Schedules]] Warning: blog post - may contain errors or conceptual misunderstandings.

===== Learning Rate Schedule =====

  * **Convergence conditions** To guarantee convergence to a (local) optimum, the learning rate schedule should satisfy certain conditions, see //convergence conditions// below.
  * **Large-initial learning rate** (see Regularization - [[ml:regularization#Learning Rate Schedule]])
  * **Learning rate annealing** Halve (or reduce) the learning rate at prespecified intervals, or under certain conditions (talked about, for example [[https://arxiv.org/pdf/1706.09733.pdf|here]])
  * **Linear decay with linear-warm up (slanted triangular)**. Proposed in [[https://arxiv.org/pdf/1801.06146.pdf|Howard & Ruder 2018]] as "slanted triangular learning rates"
  * **Cyclic learning rates**
    * [[https://arxiv.org/pdf/1506.01186.pdf|Smith 2015 - Cyclical Learning Rates for Training Neural Networks]]
    * [[https://arxiv.org/pdf/1708.07120.pdf|Smith & Topin 2017 - Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates]]
    * [[https://arxiv.org/pdf/1803.09820.pdf|Smith 2018 - A Disciplined Approach to Neural Network Hyper-parameters]] (see section 4 for an overview of cyclic learning rates)
  * **Plateau Learning Rate**
    * Decrease the learning rate when the objective reaches a plateau.  See [[https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html|PyTorch - ReduceLROnPlateau]]
  * **Other Schedules**
    * [[https://arxiv.org/pdf/1608.03983.pdf|Loshchilov & Hutter 2016 - SGDR: Stochastic Gradient Descent with Warm Restarts]] Used by nanoGPT
  * **Warm restarts** [[https://arxiv.org/pdf/1608.03983.pdf|Loshchilov & Hutter 2016 - SGDR: Stochastic Gradient Descent with Warm Restarts]]
  * **Batch size** [[https://arxiv.org/pdf/1711.00489.pdf|Smith et al 2017 - Don't Decay the Learning Rate, Increase the Batch Size]]

==== Papers ====
  * [[https://arxiv.org/pdf/2103.12682.pdf|Lewkowycz 2021 - How to decay your learning rate]]
  * [[https://arxiv.org/pdf/2202.04509.pdf|d’Ascoli et al 2022 - Optimal learning rate schedules in high-dimensional non-convex optimization problems]]
  * [[https://arxiv.org/pdf/1807.05031.pdf|Jastrzębski et al 2018 - On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length]]

==== Warm-up ====
Warm-up was originally proposed to handle training with very large batches for SGD (Goyal et al., 2017; Gotmare et al., 2019; Bernstein et al., 2018; Xiao et al., 2017).
  * [[https://arxiv.org/pdf/1908.03265.pdf|Liu et al 2019]] argues that warm-up is a variance reduction technique in the early stages of learning, when the second-order derivative hasn't been estimated properly yet.
  * [[https://arxiv.org/pdf/2002.04745.pdf|Xiong et al 2020]] argues that warm-up is necessary for post-norm transformers (the usual transformer) because post-norm transformers have unstable gradients.  They argue that pre-norm transformers don't have this problem, don't need warm-up, and are much easier to train with comparable performance when trained without warm-up.

==== Automatically Setting the Learning Rate ====
  * [[https://arxiv.org/pdf/1802.05074.pdf|Rolinek & Martius 2018 - L4: Practical Loss-based Stepsize Adaptation for Deep Learning]] Uses a linear approximation and a target loss value to pick the step size.  For cross-entropy, could use 0 as the target value. Similar to LRTuner.
  * [[https://arxiv.org/pdf/1909.13371.pdf|Chandra et al 2019 - Gradient Descent: The Ultimate Optimizer]] Stacked hyper-optimizers
  * **[[https://arxiv.org/pdf/2002.10542.pdf|Loizou et al 2020 - Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence]]**
  * [[https://arxiv.org/pdf/2103.12623.pdf|Carvalho et al 2021 - Evolving Learning Rate Optimizers for Deep Neural Networks]]
  * [[https://arxiv.org/pdf/2105.10762.pdf|Jin et al 2021 - AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly]]
  * [[https://arxiv.org/pdf/2105.14526.pdf|Iyer et al 2021 - LRTuner: A Learning Rate Tuner for Deep Neural Networks]] Uses a quadratic approximation in the direction of descent to pick the step size. Seems to work well. Similar to L4.
  * [[https://arxiv.org/pdf/2111.15317.pdf|Teng et al 2021 - AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop]]
  * **[[https://arxiv.org/pdf/2306.00144.pdf|Cutkosky et al 2023 - Mechanic: A Learning Rate Tuner]]**

==== Parameter-Free Optimization ====
Optimization algorithms that don't have a stepsize or hyperparameters.

  * [[https://arxiv.org/pdf/2302.12022.pdf|Ivgi et al 2023 - DoG is SGD’s Best Friend: A Parameter-Free Dynamic Step Size Schedule]]

==== Convergence Conditions ====
  * **SGD**
    * For stochastic gradient descent, optimization theory says the step sizes should satisfy the conditions $\sum_t \alpha_t > \infty$ and $\sum_t \alpha_t^2 < \infty$ (see for example [[https://leon.bottou.org/publications/pdf/nimes-1991.pdf|Bottou 1991]]).  A common choice that satisfies these conditions stepsizes that decay as $1/t$.  Stepsizes that satisfy these conditions were very common in machine learning and deep learning (for example [[https://arxiv.org/pdf/1603.01354.pdf|here]]).
    * For an analysis of stepsize schedules of SGD on non-convex problems, see [[http://proceedings.mlr.press/v130/gower21a/gower21a.pdf|Gower et al 2021 - SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation]]
  * **Adam**
    * In the original [[https://arxiv.org/pdf/1412.6980.pdf|Adam paper]], the proof of convergence assumes stepsizes that decay as $1/\sqrt{t}$ (see the sentence above Theorem 4.1). (Note: there's a flaw in the proof, see [[optimizers]] and [[https://openaccess.thecvf.com/content_CVPR_2019/papers/Zou_A_Sufficient_Condition_for_Convergences_of_Adam_and_RMSProp_CVPR_2019_paper.pdf|Zou 2019]], which was corrected [[https://arxiv.org/pdf/2003.02395.pdf|here]].  So $1/\sqrt{t}$ should work for reasonable hyperparameters.)
    * For the Transformer, people often use a different choice, such as linear decay with linear warmup, which was used in BERT.

===== Software =====
  * [[https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#schedules|Huggingface - Learning Rate Schedules (PyTorch)]]

===== Related Pages =====
  * [[NN Training]]
  * Regularization - [[ml:regularization#Learning Rate Schedule]].