Differences

This shows you the differences between two versions of the page.

--- ml:learning_rate [2023/10/20 22:40] – [Learning Rate Schedule] jmflanig
+++ ml:learning_rate [2024/02/06 00:31] (current) – [Automatically Setting the Learning Rate] jmflanig
@@ Line 15: / Line 15: @@
   * **Cyclic learning rates**
     * [[https://arxiv.org/pdf/1506.01186.pdf|Smith 2015 - Cyclical Learning Rates for Training Neural Networks]]
-    * [[https://arxiv.org/pdf/1608.03983.pdf|Loshchilov & Hutter 2016 - SGDR: Stochastic Gradient Descent with Warm Restarts]]
     * [[https://arxiv.org/pdf/1708.07120.pdf|Smith & Topin 2017 - Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates]]
     * [[https://arxiv.org/pdf/1803.09820.pdf|Smith 2018 - A Disciplined Approach to Neural Network Hyper-parameters]] (see section 4 for an overview of cyclic learning rates)
   * **Plateau Learning Rate**
     * Decrease the learning rate when the objective reaches a plateau.  See [[https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html|PyTorch - ReduceLROnPlateau]]
+  * **Other Schedules**
+    * [[https://arxiv.org/pdf/1608.03983.pdf|Loshchilov & Hutter 2016 - SGDR: Stochastic Gradient Descent with Warm Restarts]] Used by nanoGPT
   * **Warm restarts** [[https://arxiv.org/pdf/1608.03983.pdf|Loshchilov & Hutter 2016 - SGDR: Stochastic Gradient Descent with Warm Restarts]]
   * **Batch size** [[https://arxiv.org/pdf/1711.00489.pdf|Smith et al 2017 - Don't Decay the Learning Rate, Increase the Batch Size]]
@@ Line 41: / Line 42: @@
   * [[https://arxiv.org/pdf/2105.14526.pdf|Iyer et al 2021 - LRTuner: A Learning Rate Tuner for Deep Neural Networks]] Uses a quadratic approximation in the direction of descent to pick the step size. Seems to work well. Similar to L4.
   * [[https://arxiv.org/pdf/2111.15317.pdf|Teng et al 2021 - AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop]]
-  * [[https://arxiv.org/pdf/2306.00144.pdf|Cutkosky et al 2023 - Mechanic: A Learning Rate Tuner]]
+  * **[[https://arxiv.org/pdf/2306.00144.pdf|Cutkosky et al 2023 - Mechanic: A Learning Rate Tuner]]**
 ==== Parameter-Free Optimization ====