User Tools

Site Tools


ml:regularization

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:regularization [2021/11/02 09:47] – [$L_p$ Regularization] jmflanigml:regularization [2024/03/14 07:38] (current) – [Sparsity-Inducing Regularizers] jmflanig
Line 3: Line 3:
 ===== Regularization in Deep Learning ===== ===== Regularization in Deep Learning =====
 ==== Dropout ==== ==== Dropout ====
 +  * Dropout: [[https://arxiv.org/pdf/1207.0580.pdf|Hinton et al 2012 - Improving Neural Networks by Preventing Co-adaptation of Feature Detectors]]
 +  * DropConnect: [[http://proceedings.mlr.press/v28/wan13.pdf|Wang et al 2013 - Regularization of Neural Networks using DropConnect]]
   * [[https://nlp.stanford.edu/pubs/sidaw13fast.pdf|Wang & Manning 2013 - Fast dropout training]] Shows that dropout is an approximation to an objective, and directly optimizes a fast approximation to this objective. "We show how to do fast dropout training by sampling from or integrating a Gaussian approximation, instead of doing Monte Carlo optimization of this objective. This approximation, justified by the central limit theorem and empirical evidence, gives an order of magnitude speedup and more stability"   * [[https://nlp.stanford.edu/pubs/sidaw13fast.pdf|Wang & Manning 2013 - Fast dropout training]] Shows that dropout is an approximation to an objective, and directly optimizes a fast approximation to this objective. "We show how to do fast dropout training by sampling from or integrating a Gaussian approximation, instead of doing Monte Carlo optimization of this objective. This approximation, justified by the central limit theorem and empirical evidence, gives an order of magnitude speedup and more stability"
   * [[https://papers.nips.cc/paper/2013/file/7b5b23f4aadf9513306bcd59afb6e4c9-Paper.pdf|Adaptive Dropout]]  Learns a dropout network during training ([[https://paperswithcode.com/method/adaptive-dropout|summary]])   * [[https://papers.nips.cc/paper/2013/file/7b5b23f4aadf9513306bcd59afb6e4c9-Paper.pdf|Adaptive Dropout]]  Learns a dropout network during training ([[https://paperswithcode.com/method/adaptive-dropout|summary]])
   * [[https://arxiv.org/pdf/1606.01305.pdf|Zoneout (Krueger et al 2016)]] (For regularizing RNNs) "At each timestep, zoneout stochastically forces some hidden units to maintain their previous values"   * [[https://arxiv.org/pdf/1606.01305.pdf|Zoneout (Krueger et al 2016)]] (For regularizing RNNs) "At each timestep, zoneout stochastically forces some hidden units to maintain their previous values"
 +  * [[https://arxiv.org/pdf/1909.11299.pdf|Lee et al 2019 - Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models]]
 +  * LayerDrop: [[https://arxiv.org/pdf/1909.11556.pdf|Fan et al 2019 - Reducing Transformer Depth on Demand with Structured Dropout]] Regularizes networks to allow extraction of smaller networks of any depth at test time without needing to finetune.
   * [[https://arxiv.org/pdf/2101.01761.pdf|Pham & Le 2021 - AutoDropout: Learning Dropout Patterns to Regularize Deep Networks]]  Shows an improvement of 1-2 BLEU for machine translation. Downside: computationally expensive.   * [[https://arxiv.org/pdf/2101.01761.pdf|Pham & Le 2021 - AutoDropout: Learning Dropout Patterns to Regularize Deep Networks]]  Shows an improvement of 1-2 BLEU for machine translation. Downside: computationally expensive.
 +  * [[https://arxiv.org/pdf/2106.14448.pdf|Liang et al 2021 - R-Drop: Regularized Dropout for Neural Networks]] Universally works, except see the github issues about reproducibility
  
 ==== Early-stopping ==== ==== Early-stopping ====
Line 13: Line 18:
 Although L2 regularization (weight decay) wasn't popular in early deep learning models in NLP, it has become popular in pre-trained transformer models.  For example, BERT and RoBERTa both use a weight decay of .01 ([[https://arxiv.org/pdf/1907.11692.pdf|RoBERTa paper]]).  Although Adam with weight decay isn't L2 regularization, it can be fixed using [[https://arxiv.org/pdf/1711.05101.pdf|WAdam]] (see below). Although L2 regularization (weight decay) wasn't popular in early deep learning models in NLP, it has become popular in pre-trained transformer models.  For example, BERT and RoBERTa both use a weight decay of .01 ([[https://arxiv.org/pdf/1907.11692.pdf|RoBERTa paper]]).  Although Adam with weight decay isn't L2 regularization, it can be fixed using [[https://arxiv.org/pdf/1711.05101.pdf|WAdam]] (see below).
   * WAdam: [[https://arxiv.org/pdf/1711.05101.pdf|Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization]] Fixes weight decay in Adam so that it correctly implements L2 regularization.  Applied to the Transformer in [[https://arxiv.org/pdf/2006.00719.pdf|Yao 2020]]   * WAdam: [[https://arxiv.org/pdf/1711.05101.pdf|Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization]] Fixes weight decay in Adam so that it correctly implements L2 regularization.  Applied to the Transformer in [[https://arxiv.org/pdf/2006.00719.pdf|Yao 2020]]
 +
 +==== Max-Norm Regularization ====
  
 ==== $L_p$ Regularization ===== ==== $L_p$ Regularization =====
Line 20: Line 27:
     * [[https://arxiv.org/pdf/1712.01312.pdf|Louizos et al 2018 - Learning Sparse Neural Networks through L0 Regularization]]     * [[https://arxiv.org/pdf/1712.01312.pdf|Louizos et al 2018 - Learning Sparse Neural Networks through L0 Regularization]]
  
-==== Max-Norm Regularization ====+==== Sparsity-Inducing Regularizers ==== 
 +  * **Structured Sparsity** 
 +    * [[https://www.di.ens.fr/~fbach/STS394.pdf|Bach et al 2012 - Structured Sparsity through Convex Optimization]] 
 +    * [[https://arxiv.org/pdf/1608.03665.pdf|Wen et al 2016 - Learning Structured Sparsity in Deep Neural 
 +Networks]]
  
 ==== Label Smoothing ==== ==== Label Smoothing ====
Line 70: Line 81:
 </blockquote> </blockquote>
 === Papers === === Papers ===
 +  * [[http://proceedings.mlr.press/v28/wang13a.pdf|Wang & Manning 2013 - Fast Dropout Training]]
   * [[https://arxiv.org/pdf/1506.02142.pdf|Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning]] and slides [[https://tensorchiefs.github.io/bbs/files/dropouts-brownbag.pdf|here]].   * [[https://arxiv.org/pdf/1506.02142.pdf|Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning]] and slides [[https://tensorchiefs.github.io/bbs/files/dropouts-brownbag.pdf|here]].
   * [[https://arxiv.org/pdf/2002.12915.pdf|Wei et al 2020 - The Implicit and Explicit Regularization Effects of Dropout]]   * [[https://arxiv.org/pdf/2002.12915.pdf|Wei et al 2020 - The Implicit and Explicit Regularization Effects of Dropout]]
ml/regularization.1635846442.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki