Differences

This shows you the differences between two versions of the page.

--- ml:regularization [2021/11/02 09:46] – [Lp Regularization] jmflanig
+++ ml:regularization [2024/03/14 07:38] (current) – [Sparsity-Inducing Regularizers] jmflanig
@@ Line 3: / Line 3: @@
 ===== Regularization in Deep Learning =====
 ==== Dropout ====
+  * Dropout: [[https://arxiv.org/pdf/1207.0580.pdf|Hinton et al 2012 - Improving Neural Networks by Preventing Co-adaptation of Feature Detectors]]
+  * DropConnect: [[http://proceedings.mlr.press/v28/wan13.pdf|Wang et al 2013 - Regularization of Neural Networks using DropConnect]]
   * [[https://nlp.stanford.edu/pubs/sidaw13fast.pdf|Wang & Manning 2013 - Fast dropout training]] Shows that dropout is an approximation to an objective, and directly optimizes a fast approximation to this objective. "We show how to do fast dropout training by sampling from or integrating a Gaussian approximation, instead of doing Monte Carlo optimization of this objective. This approximation, justified by the central limit theorem and empirical evidence, gives an order of magnitude speedup and more stability"
   * [[https://papers.nips.cc/paper/2013/file/7b5b23f4aadf9513306bcd59afb6e4c9-Paper.pdf|Adaptive Dropout]]  Learns a dropout network during training ([[https://paperswithcode.com/method/adaptive-dropout|summary]])
   * [[https://arxiv.org/pdf/1606.01305.pdf|Zoneout (Krueger et al 2016)]] (For regularizing RNNs) "At each timestep, zoneout stochastically forces some hidden units to maintain their previous values"
+  * [[https://arxiv.org/pdf/1909.11299.pdf|Lee et al 2019 - Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models]]
+  * LayerDrop: [[https://arxiv.org/pdf/1909.11556.pdf|Fan et al 2019 - Reducing Transformer Depth on Demand with Structured Dropout]] Regularizes networks to allow extraction of smaller networks of any depth at test time without needing to finetune.
   * [[https://arxiv.org/pdf/2101.01761.pdf|Pham & Le 2021 - AutoDropout: Learning Dropout Patterns to Regularize Deep Networks]]  Shows an improvement of 1-2 BLEU for machine translation. Downside: computationally expensive.
+  * [[https://arxiv.org/pdf/2106.14448.pdf|Liang et al 2021 - R-Drop: Regularized Dropout for Neural Networks]] Universally works, except see the github issues about reproducibility
 ==== Early-stopping ====
@@ Line 13: / Line 18: @@
 Although L2 regularization (weight decay) wasn't popular in early deep learning models in NLP, it has become popular in pre-trained transformer models.  For example, BERT and RoBERTa both use a weight decay of .01 ([[https://arxiv.org/pdf/1907.11692.pdf|RoBERTa paper]]).  Although Adam with weight decay isn't L2 regularization, it can be fixed using [[https://arxiv.org/pdf/1711.05101.pdf|WAdam]] (see below).
   * WAdam: [[https://arxiv.org/pdf/1711.05101.pdf|Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization]] Fixes weight decay in Adam so that it correctly implements L2 regularization.  Applied to the Transformer in [[https://arxiv.org/pdf/2006.00719.pdf|Yao 2020]]
+==== Max-Norm Regularization ====
 ==== $L_p$ Regularization =====
-$L_p$ regularization is regularization with an $L_p$ norm term.  Popular choices of $p$ are  $L_2$ (weight decay), $L_1$ (Lasso, induces sparsity while still being convex), $L_0$ (very non-convex: counts the number of non-zero parameters. difficult to optimize), and $L_\infty$ (max-norm regularization).  $L_1$ regularization combined with $L_2$ regularization (adding both terms as regularizers to the objective) is called an elastic net, and sometimes performs better than either one separately.
+$L_p$ regularization is regularization with an $L_p$ norm term.  Popular choices of $p$ are  $L_2$ (weight decay), $L_1$ (Lasso, induces sparsity while still being convex), $L_0$ (very non-convex: counts the number of non-zero parameters. difficult to optimize), and $L_\infty$ (max-norm regularization).  $L_1$ regularization combined with $L_2$ regularization (adding both regularization terms to the objective) is called an elastic net, and sometimes performs better than either one separately.
-  * $L_0$
+  * $L_0$ regularization
     * [[https://arxiv.org/pdf/1712.01312.pdf|Louizos et al 2018 - Learning Sparse Neural Networks through L0 Regularization]]
-==== Max-Norm Regularization ====
+==== Sparsity-Inducing Regularizers ====
+  * **Structured Sparsity**
+    * [[https://www.di.ens.fr/~fbach/STS394.pdf|Bach et al 2012 - Structured Sparsity through Convex Optimization]]
+    * [[https://arxiv.org/pdf/1608.03665.pdf|Wen et al 2016 - Learning Structured Sparsity in Deep Neural
+Networks]]
 ==== Label Smoothing ====
@@ Line 70: / Line 81: @@
 </blockquote>
 === Papers ===
+  * [[http://proceedings.mlr.press/v28/wang13a.pdf|Wang & Manning 2013 - Fast Dropout Training]]
   * [[https://arxiv.org/pdf/1506.02142.pdf|Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning]] and slides [[https://tensorchiefs.github.io/bbs/files/dropouts-brownbag.pdf|here]].
   * [[https://arxiv.org/pdf/2002.12915.pdf|Wei et al 2020 - The Implicit and Explicit Regularization Effects of Dropout]]