ml:regularization
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| ml:regularization [2021/03/11 10:22] – [Weight decay] jmflanig | ml:regularization [2024/03/14 07:38] (current) – [Sparsity-Inducing Regularizers] jmflanig | ||
|---|---|---|---|
| Line 3: | Line 3: | ||
| ===== Regularization in Deep Learning ===== | ===== Regularization in Deep Learning ===== | ||
| ==== Dropout ==== | ==== Dropout ==== | ||
| + | * Dropout: [[https:// | ||
| + | * DropConnect: | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * [[https:// | ||
| + | * LayerDrop: [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * [[https:// | ||
| ==== Early-stopping ==== | ==== Early-stopping ==== | ||
| Line 15: | Line 20: | ||
| ==== Max-Norm Regularization ==== | ==== Max-Norm Regularization ==== | ||
| + | |||
| + | ==== $L_p$ Regularization ===== | ||
| + | $L_p$ regularization is regularization with an $L_p$ norm term. Popular choices of $p$ are $L_2$ (weight decay), $L_1$ (Lasso, induces sparsity while still being convex), $L_0$ (very non-convex: counts the number of non-zero parameters. difficult to optimize), and $L_\infty$ (max-norm regularization). | ||
| + | |||
| + | * $L_0$ regularization | ||
| + | * [[https:// | ||
| + | |||
| + | ==== Sparsity-Inducing Regularizers ==== | ||
| + | * **Structured Sparsity** | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | Networks]] | ||
| ==== Label Smoothing ==== | ==== Label Smoothing ==== | ||
| Line 29: | Line 46: | ||
| ==== Learning Rate Schedule ==== | ==== Learning Rate Schedule ==== | ||
| - | From [[https:// | + | **Regularizing effect:** A large initial learning rate can have a regularizing effect, and is commonly used for training Transformers. |
| - | network even though it slows down optimization of the train loss. Modern state-of-the-art architectures | + | typically start with a large learning rate and anneal it at a point when the model’s fit to the training data plateaus [25, 32, 17, 42]. Meanwhile, models trained using only small learning rates have been found to |
| - | typically start with a large learning rate and anneal it at a point when the model’s fit to the training data | + | |
| - | plateaus [25, 32, 17, 42]. Meanwhile, models trained using only small learning rates have been found to | + | |
| generalize poorly despite enjoying faster optimization of the training loss." | generalize poorly despite enjoying faster optimization of the training loss." | ||
| Line 66: | Line 81: | ||
| </ | </ | ||
| === Papers === | === Papers === | ||
| + | * [[http:// | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| Line 76: | Line 92: | ||
| ===== Related Pages ===== | ===== Related Pages ===== | ||
| + | * [[Ensembling]] | ||
| * [[ml: | * [[ml: | ||
ml/regularization.1615458173.txt.gz · Last modified: 2023/06/15 07:36 (external edit)