User Tools

Site Tools


ml:regularization

Regularization

Regularization in Deep Learning

Dropout

Early-stopping

Weight decay

Although L2 regularization (weight decay) wasn't popular in early deep learning models in NLP, it has become popular in pre-trained transformer models. For example, BERT and RoBERTa both use a weight decay of .01 (RoBERTa paper). Although Adam with weight decay isn't L2 regularization, it can be fixed using WAdam (see below).

Max-Norm Regularization

$L_p$ Regularization

$L_p$ regularization is regularization with an $L_p$ norm term. Popular choices of $p$ are $L_2$ (weight decay), $L_1$ (Lasso, induces sparsity while still being convex), $L_0$ (very non-convex: counts the number of non-zero parameters. difficult to optimize), and $L_\infty$ (max-norm regularization). $L_1$ regularization combined with $L_2$ regularization (adding both regularization terms to the objective) is called an elastic net, and sometimes performs better than either one separately.

Sparsity-Inducing Regularizers

Label Smoothing

Batch Size

Smaller batch sizes usually generalize better, and batch size is a hyperparameter that can be tuned. From Wei et al 2020: “The implicit regularization effect of stochasticity in SGD has been empirically studied in the context of small v.s. large batch training Keskar et al. (2016), where it is observed that noisier small-batch SGD converges to “flatter” local minima which generalize better, whereas large-batch SGD converges “sharper” local minima which generalize more poorly.” There's also theory about this going back to the 90's.

Batch Normalizaton

Batch normalization has been observed to have a regularizing effect.

Learning Rate Schedule

Regularizing effect: A large initial learning rate can have a regularizing effect, and is commonly used for training Transformers. From Li et al 2019: “… a large initial learning rate is required to successfully train a deep network even though it slows down optimization of the train loss. Modern state-of-the-art architectures typically start with a large learning rate and anneal it at a point when the model’s fit to the training data plateaus [25, 32, 17, 42]. Meanwhile, models trained using only small learning rates have been found to generalize poorly despite enjoying faster optimization of the training loss.”

Theory

Dropout

From Ma et al 2017:

In their pioneering work, Hinton et al. (2012) and Srivastava et al. (2014) interpreted dropout as an extreme form of model combination (aka. model ensemble) with extensive parameter/weight sharing, and they proposed to learn the combination through minimizing an appropriate expected loss. Interestingly, they also pointed out that for a single logistic neural unit, the output of dropout is in fact the geometric mean of the outputs of the model ensemble with shared parameters. Subsequently, many theoretical justifications of dropout have been explored, and we can only mention a few here due to space limits. Building on the weight sharing perspective, Baldi & Sadowski (2013; 2014) analyzed the ensemble averaging property of dropout in deep non-linear logistic networks, and supported the view that dropout is equivalent to applying stochastic gradient descent on some regularized loss function. Wager et al. (2013) treated dropout as an adaptive regularizer for generalized linear models (GLMs). Helmbold & Long (2016) discussed the differences between dropout and traditional weight decay regularization. In terms of statistical learning theory, Gao & Zhou (2014) studied the Rademacher complexity of different types of dropout, showing that dropout is able to reduce the Rademacher complexity polynomially for shallow neural networks (with one or no hidden layers) and exponentially for deep neural networks. This latter work (Gao & Zhou, 2014) formally demonstrated that dropout, due to its regularizing effect, contributes to reducing the inherent model complexity, in particular the variance component in the generalization error.

Seen as a model combination technique, it is intuitive that dropout contributes to reducing the variance of the model performance. Surprisingly, dropout has also been shown to play some role in reducing the model bias. For instance, Jain et al. (2015) studied the ability of dropout training to escape local minima, hence leading to reduced model bias. Other studies (Chen et al., 2014; Helmbold & Long, 2014; Wager et al., 2014) focus on the effect of the dropout noise on models with shallow architectures. We noted in passing that there are also some work (Kingma et al., 2015; Gal & Ghahramani, 2015; 2016) trying to understand dropout from the Bayesian perspective.

Papers

Learning Rate Schedule

L2 Regularization

  • L2 Regularization is almost equivalent to early-stopping. See slides 18-21 here (2020 version).
ml/regularization.txt · Last modified: 2024/03/14 07:38 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki