This is an old revision of the document!

Normalization

Batch Normalization

Batch normalization is popular in computer vision, but not usually used in NLP because it doesn't work well. Layer normalization is usually used instead (see Shen 2020)

Issues with RNNs

Layer Normalization

Layer Normalization

Other Normalization Schemes

Weight Normalization, alternative to batch normalization. Salimans & Kingma 2016 - Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks “…improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited.” See section 2.2 Relation to batch normalization.
RMSNorm. Improvement to layer normalization. Computationally more efficient, and gives improved invariance properties. Confirmed by Narang et al 2021 to work well.
Shen et al 2020 - PowerNorm: Rethinking Batch Normalization in Transformers

NLP Wiki

Table of Contents

Normalization

Batch Normalization

Layer Normalization

Other Normalization Schemes