ml:learning_rate
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| ml:learning_rate [2023/03/08 22:19] – [Automatically Setting the Learning Rate] jmflanig | ml:learning_rate [2024/02/06 00:31] (current) – [Automatically Setting the Learning Rate] jmflanig | ||
|---|---|---|---|
| Line 12: | Line 12: | ||
| * **Large-initial learning rate** (see Regularization - [[ml: | * **Large-initial learning rate** (see Regularization - [[ml: | ||
| * **Learning rate annealing** Halve (or reduce) the learning rate at prespecified intervals, or under certain conditions (talked about, for example [[https:// | * **Learning rate annealing** Halve (or reduce) the learning rate at prespecified intervals, or under certain conditions (talked about, for example [[https:// | ||
| + | * **Linear decay with linear-warm up (slanted triangular)**. Proposed in [[https:// | ||
| * **Cyclic learning rates** | * **Cyclic learning rates** | ||
| * [[https:// | * [[https:// | ||
| Line 18: | Line 19: | ||
| * **Plateau Learning Rate** | * **Plateau Learning Rate** | ||
| * Decrease the learning rate when the objective reaches a plateau. | * Decrease the learning rate when the objective reaches a plateau. | ||
| + | * **Other Schedules** | ||
| + | * [[https:// | ||
| * **Warm restarts** [[https:// | * **Warm restarts** [[https:// | ||
| * **Batch size** [[https:// | * **Batch size** [[https:// | ||
| Line 24: | Line 27: | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * [[https:// | ||
| ==== Warm-up ==== | ==== Warm-up ==== | ||
| Line 31: | Line 35: | ||
| ==== Automatically Setting the Learning Rate ==== | ==== Automatically Setting the Learning Rate ==== | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * **[[https:// | ||
| + | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| - | * [[https:// | + | * [[https:// |
| * [[https:// | * [[https:// | ||
| + | * **[[https:// | ||
| + | |||
| + | ==== Parameter-Free Optimization ==== | ||
| + | Optimization algorithms that don't have a stepsize or hyperparameters. | ||
| + | |||
| + | * [[https:// | ||
| ==== Convergence Conditions ==== | ==== Convergence Conditions ==== | ||
| * **SGD** | * **SGD** | ||
| * For stochastic gradient descent, optimization theory says the step sizes should satisfy the conditions $\sum_t \alpha_t > \infty$ and $\sum_t \alpha_t^2 < \infty$ (see for example [[https:// | * For stochastic gradient descent, optimization theory says the step sizes should satisfy the conditions $\sum_t \alpha_t > \infty$ and $\sum_t \alpha_t^2 < \infty$ (see for example [[https:// | ||
| + | * For an analysis of stepsize schedules of SGD on non-convex problems, see [[http:// | ||
| * **Adam** | * **Adam** | ||
| * In the original [[https:// | * In the original [[https:// | ||
ml/learning_rate.1678313950.txt.gz · Last modified: 2023/06/15 07:36 (external edit)