Table of Contents
Regularization
Regularization in Deep Learning
Dropout
- Wang & Manning 2013 - Fast dropout training Shows that dropout is an approximation to an objective, and directly optimizes a fast approximation to this objective. “We show how to do fast dropout training by sampling from or integrating a Gaussian approximation, instead of doing Monte Carlo optimization of this objective. This approximation, justified by the central limit theorem and empirical evidence, gives an order of magnitude speedup and more stability”
- Adaptive Dropout Learns a dropout network during training (summary)
- Zoneout (Krueger et al 2016) (For regularizing RNNs) “At each timestep, zoneout stochastically forces some hidden units to maintain their previous values”
- LayerDrop: Fan et al 2019 - Reducing Transformer Depth on Demand with Structured Dropout Regularizes networks to allow extraction of smaller networks of any depth at test time without needing to finetune.
- Pham & Le 2021 - AutoDropout: Learning Dropout Patterns to Regularize Deep Networks Shows an improvement of 1-2 BLEU for machine translation. Downside: computationally expensive.
- Liang et al 2021 - R-Drop: Regularized Dropout for Neural Networks Universally works, except see the github issues about reproducibility
Early-stopping
Weight decay
Although L2 regularization (weight decay) wasn't popular in early deep learning models in NLP, it has become popular in pre-trained transformer models. For example, BERT and RoBERTa both use a weight decay of .01 (RoBERTa paper). Although Adam with weight decay isn't L2 regularization, it can be fixed using WAdam (see below).
- WAdam: Loshchilov & Hutter 2017 - Decoupled Weight Decay Regularization Fixes weight decay in Adam so that it correctly implements L2 regularization. Applied to the Transformer in Yao 2020
Max-Norm Regularization
$L_p$ Regularization
$L_p$ regularization is regularization with an $L_p$ norm term. Popular choices of $p$ are $L_2$ (weight decay), $L_1$ (Lasso, induces sparsity while still being convex), $L_0$ (very non-convex: counts the number of non-zero parameters. difficult to optimize), and $L_\infty$ (max-norm regularization). $L_1$ regularization combined with $L_2$ regularization (adding both regularization terms to the objective) is called an elastic net, and sometimes performs better than either one separately.
- $L_0$ regularization
Sparsity-Inducing Regularizers
- Structured Sparsity
Label Smoothing
Introduced here: Szegedy et al 2015 - Rethinking the Inception Architecture for Computer Vision, and used in BERT.
Batch Size
Smaller batch sizes usually generalize better, and batch size is a hyperparameter that can be tuned. From Wei et al 2020: “The implicit regularization effect of stochasticity in SGD has been empirically studied in the context of small v.s. large batch training Keskar et al. (2016), where it is observed that noisier small-batch SGD converges to “flatter” local minima which generalize better, whereas large-batch SGD converges “sharper” local minima which generalize more poorly.” There's also theory about this going back to the 90's.
Batch Normalizaton
Batch normalization has been observed to have a regularizing effect.
Learning Rate Schedule
Regularizing effect: A large initial learning rate can have a regularizing effect, and is commonly used for training Transformers. From Li et al 2019: “… a large initial learning rate is required to successfully train a deep network even though it slows down optimization of the train loss. Modern state-of-the-art architectures typically start with a large learning rate and anneal it at a point when the model’s fit to the training data plateaus [25, 32, 17, 42]. Meanwhile, models trained using only small learning rates have been found to generalize poorly despite enjoying faster optimization of the training loss.”
Theory
Dropout
From Ma et al 2017:
In their pioneering work, Hinton et al. (2012) and Srivastava et al. (2014) interpreted dropout as an extreme form of model combination (aka. model ensemble) with extensive parameter/weight sharing, and they proposed to learn the combination through minimizing an appropriate expected loss. Interestingly, they also pointed out that for a single logistic neural unit, the output of dropout is in fact the geometric mean of the outputs of the model ensemble with shared parameters. Subsequently, many theoretical justifications of dropout have been explored, and we can only mention a few here due to space limits. Building on the weight sharing perspective, Baldi & Sadowski (2013; 2014) analyzed the ensemble averaging property of dropout in deep non-linear logistic networks, and supported the view that dropout is equivalent to applying stochastic gradient descent on some regularized loss function. Wager et al. (2013) treated dropout as an adaptive regularizer for generalized linear models (GLMs). Helmbold & Long (2016) discussed the differences between dropout and traditional weight decay regularization. In terms of statistical learning theory, Gao & Zhou (2014) studied the Rademacher complexity of different types of dropout, showing that dropout is able to reduce the Rademacher complexity polynomially for shallow neural networks (with one or no hidden layers) and exponentially for deep neural networks. This latter work (Gao & Zhou, 2014) formally demonstrated that dropout, due to its regularizing effect, contributes to reducing the inherent model complexity, in particular the variance component in the generalization error.
Seen as a model combination technique, it is intuitive that dropout contributes to reducing the variance of the model performance. Surprisingly, dropout has also been shown to play some role in reducing the model bias. For instance, Jain et al. (2015) studied the ability of dropout training to escape local minima, hence leading to reduced model bias. Other studies (Chen et al., 2014; Helmbold & Long, 2014; Wager et al., 2014) focus on the effect of the dropout noise on models with shallow architectures. We noted in passing that there are also some work (Kingma et al., 2015; Gal & Ghahramani, 2015; 2016) trying to understand dropout from the Bayesian perspective.
Papers
Learning Rate Schedule
L2 Regularization
- L2 Regularization is almost equivalent to early-stopping. See slides 18-21 here (2020 version).