Differences

This shows you the differences between two versions of the page.

--- ml:regularization [2021/03/11 10:22] – [Weight decay] jmflanig
+++ ml:regularization [2024/03/14 07:38] (current) – [Sparsity-Inducing Regularizers] jmflanig
@@ Line 3: / Line 3: @@
 ===== Regularization in Deep Learning =====
 ==== Dropout ====
+  * Dropout: [[https://arxiv.org/pdf/1207.0580.pdf|Hinton et al 2012 - Improving Neural Networks by Preventing Co-adaptation of Feature Detectors]]
+  * DropConnect: [[http://proceedings.mlr.press/v28/wan13.pdf|Wang et al 2013 - Regularization of Neural Networks using DropConnect]]
   * [[https://nlp.stanford.edu/pubs/sidaw13fast.pdf|Wang & Manning 2013 - Fast dropout training]] Shows that dropout is an approximation to an objective, and directly optimizes a fast approximation to this objective. "We show how to do fast dropout training by sampling from or integrating a Gaussian approximation, instead of doing Monte Carlo optimization of this objective. This approximation, justified by the central limit theorem and empirical evidence, gives an order of magnitude speedup and more stability"
   * [[https://papers.nips.cc/paper/2013/file/7b5b23f4aadf9513306bcd59afb6e4c9-Paper.pdf|Adaptive Dropout]]  Learns a dropout network during training ([[https://paperswithcode.com/method/adaptive-dropout|summary]])
   * [[https://arxiv.org/pdf/1606.01305.pdf|Zoneout (Krueger et al 2016)]] (For regularizing RNNs) "At each timestep, zoneout stochastically forces some hidden units to maintain their previous values"
+  * [[https://arxiv.org/pdf/1909.11299.pdf|Lee et al 2019 - Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models]]
+  * LayerDrop: [[https://arxiv.org/pdf/1909.11556.pdf|Fan et al 2019 - Reducing Transformer Depth on Demand with Structured Dropout]] Regularizes networks to allow extraction of smaller networks of any depth at test time without needing to finetune.
   * [[https://arxiv.org/pdf/2101.01761.pdf|Pham & Le 2021 - AutoDropout: Learning Dropout Patterns to Regularize Deep Networks]]  Shows an improvement of 1-2 BLEU for machine translation. Downside: computationally expensive.
+  * [[https://arxiv.org/pdf/2106.14448.pdf|Liang et al 2021 - R-Drop: Regularized Dropout for Neural Networks]] Universally works, except see the github issues about reproducibility
 ==== Early-stopping ====
@@ Line 15: / Line 20: @@
 ==== Max-Norm Regularization ====
+==== $L_p$ Regularization =====
+$L_p$ regularization is regularization with an $L_p$ norm term.  Popular choices of $p$ are  $L_2$ (weight decay), $L_1$ (Lasso, induces sparsity while still being convex), $L_0$ (very non-convex: counts the number of non-zero parameters. difficult to optimize), and $L_\infty$ (max-norm regularization).  $L_1$ regularization combined with $L_2$ regularization (adding both regularization terms to the objective) is called an elastic net, and sometimes performs better than either one separately.
+  * $L_0$ regularization
+    * [[https://arxiv.org/pdf/1712.01312.pdf|Louizos et al 2018 - Learning Sparse Neural Networks through L0 Regularization]]
+==== Sparsity-Inducing Regularizers ====
+  * **Structured Sparsity**
+    * [[https://www.di.ens.fr/~fbach/STS394.pdf|Bach et al 2012 - Structured Sparsity through Convex Optimization]]
+    * [[https://arxiv.org/pdf/1608.03665.pdf|Wen et al 2016 - Learning Structured Sparsity in Deep Neural
+Networks]]
 ==== Label Smoothing ====
@@ Line 29: / Line 46: @@
 ==== Learning Rate Schedule ====
-From [[https://arxiv.org/pdf/1907.04595.pdf|Li et al 2019]]: "... a large initial learning rate is required to successfully train a deep
+**Regularizing effect:** A large initial learning rate can have a regularizing effect, and is commonly used for training Transformers.  From [[https://arxiv.org/pdf/1907.04595.pdf|Li et al 2019]]: "... a large initial learning rate is required to successfully train a deep network even though it slows down optimization of the train loss. Modern state-of-the-art architectures
-network even though it slows down optimization of the train loss. Modern state-of-the-art architectures
+typically start with a large learning rate and anneal it at a point when the model’s fit to the training data plateaus [25, 32, 17, 42]. Meanwhile, models trained using only small learning rates have been found to
-typically start with a large learning rate and anneal it at a point when the model’s fit to the training data
-plateaus [25, 32, 17, 42]. Meanwhile, models trained using only small learning rates have been found to
 generalize poorly despite enjoying faster optimization of the training loss."
@@ Line 66: / Line 81: @@
 </blockquote>
 === Papers ===
+  * [[http://proceedings.mlr.press/v28/wang13a.pdf|Wang & Manning 2013 - Fast Dropout Training]]
   * [[https://arxiv.org/pdf/1506.02142.pdf|Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning]] and slides [[https://tensorchiefs.github.io/bbs/files/dropouts-brownbag.pdf|here]].
   * [[https://arxiv.org/pdf/2002.12915.pdf|Wei et al 2020 - The Implicit and Explicit Regularization Effects of Dropout]]
@@ Line 76: / Line 92: @@
 ===== Related Pages =====
+  * [[Ensembling]]
   * [[ml:theory:Generalization in Deep Learning]]