====== Normalization ======
Normalization can improve the optimizer's ability to train a neural network.  There are two main categories of normalization procedures: activation normalization and weight normalization ([[https://arxiv.org/pdf/2003.07845.pdf|Shen 2020]]).

===== Overviews =====
  * Blog post: [[https://medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8|2019 - Normalization Techniques in Deep Neural Networks]]

===== Activation Normalization Schemes =====

==== Batch Normalization ====
Batch normalization is popular in computer vision, but not usually used in NLP because it doesn't work well.  Layer normalization is usually used instead (see [[https://arxiv.org/pdf/2003.07845.pdf|Shen 2020]]).
  * [[https://arxiv.org/pdf/1502.03167.pdf|Ioffe & Szegedy 2015 - Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift]]
  * Issues with RNNs
    * [[https://arxiv.org/pdf/1510.01378.pdf|Laurent et al 2015 - Batch Normalized Recurrent Neural Networks]]
    * [[https://arxiv.org/pdf/1603.09025.pdf|Cooijmans et al 2016 - Recurrent Batch Normalization]]
  * [[https://arxiv.org/pdf/1806.02375.pdf|Bjorck et al 2018 - Understanding Batch Normalization]]. See also section 3 of [[https://arxiv.org/pdf/2002.10444.pdf|De & Smith 2020 - Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks]] for a different perspective.
  * [[https://arxiv.org/pdf/2003.07845.pdf|Shen et al 2020 - PowerNorm: Rethinking Batch Normalization in Transformers]]

==== Layer Normalization ====
  * [[https://arxiv.org/pdf/1607.06450.pdf|Layer Normalization]]
  * [[https://arxiv.org/pdf/1910.07467.pdf|RMSNorm]]. Improvement to layer normalization. Computationally more efficient, and gives improved invariance properties. Shown to work well for Transformers by [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021]].

===== Weight Normalization Schemes ====

==== Weight Normalization ====
  * Weight normalization is billed as an alternative to batch normalization. [[https://arxiv.org/pdf/1602.07868.pdf|Salimans & Kingma 2016 - Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks]] "...improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited."  See section 2.2 Relation to batch normalization.

===== Other or Uncategorized Schemes =====


===== Related Pages =====
  * [[NN Training#Training Setups in the Literature]]