Differences

This shows you the differences between two versions of the page.

--- ml:regularization [2022/05/09 07:32] – [Dropout] jmflanig
+++ ml:regularization [2024/03/14 07:38] (current) – [Sparsity-Inducing Regularizers] jmflanig
@@ Line 8: / Line 8: @@
   * [[https://papers.nips.cc/paper/2013/file/7b5b23f4aadf9513306bcd59afb6e4c9-Paper.pdf|Adaptive Dropout]]  Learns a dropout network during training ([[https://paperswithcode.com/method/adaptive-dropout|summary]])
   * [[https://arxiv.org/pdf/1606.01305.pdf|Zoneout (Krueger et al 2016)]] (For regularizing RNNs) "At each timestep, zoneout stochastically forces some hidden units to maintain their previous values"
+  * [[https://arxiv.org/pdf/1909.11299.pdf|Lee et al 2019 - Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models]]
+  * LayerDrop: [[https://arxiv.org/pdf/1909.11556.pdf|Fan et al 2019 - Reducing Transformer Depth on Demand with Structured Dropout]] Regularizes networks to allow extraction of smaller networks of any depth at test time without needing to finetune.
   * [[https://arxiv.org/pdf/2101.01761.pdf|Pham & Le 2021 - AutoDropout: Learning Dropout Patterns to Regularize Deep Networks]]  Shows an improvement of 1-2 BLEU for machine translation. Downside: computationally expensive.
   * [[https://arxiv.org/pdf/2106.14448.pdf|Liang et al 2021 - R-Drop: Regularized Dropout for Neural Networks]] Universally works, except see the github issues about reproducibility
@@ Line 24: / Line 26: @@
   * $L_0$ regularization
     * [[https://arxiv.org/pdf/1712.01312.pdf|Louizos et al 2018 - Learning Sparse Neural Networks through L0 Regularization]]
+==== Sparsity-Inducing Regularizers ====
+  * **Structured Sparsity**
+    * [[https://www.di.ens.fr/~fbach/STS394.pdf|Bach et al 2012 - Structured Sparsity through Convex Optimization]]
+    * [[https://arxiv.org/pdf/1608.03665.pdf|Wen et al 2016 - Learning Structured Sparsity in Deep Neural
+Networks]]
 ==== Label Smoothing ====