====== Neural Network Initialization ======

===== Overviews =====
  * Blog post: [[https://pouannes.github.io/blog/initialization/|How to initialize deep neural networks? Xavier and Kaiming initialization]]
  * Section 8.4 in [[https://www.deeplearningbook.org/contents/optimization.html|Deep Learning Book Ch 8]]
  * Initialization section in [[https://ucsc.primo.exlibrisgroup.com/permalink/01CDL_SCR_INST/1kt68tt/alma991025070453104876|Chapter 11 of Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow (UCSC login required)]]
  * [[https://classes.soe.ucsc.edu/nlp202/Winter22/slides/nn-training.pdf#page=9|NLP 202 Winter 2022 slides]]

===== Papers =====

  * Glorot (Xavier) initialization: [[http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf|Glorot & Bengio 2010 - Understanding the Difficulty of Training Deep Feedforward Neural Networks]] (use with sigmoid activations)
    * Intution: if you initialize the network randomly, then for a given neuron, if there are a lot of incoming connections, then they will tend to cancel and not saturate the neuron. BUT, if the network is deep, then if variance of the neurons goes up as you go to higher layers, then you'll have a problem because they will saturate as you go higher
  * He initialization: [[https://arxiv.org/pdf/1502.01852v1.pdf|He et al 2015]] (use with ReLu activations)
  * ADMIN: [[https://arxiv.org/pdf/2004.08249.pdf|Liu et al 2020 - Understanding the Difficulty of Training Transformers]]
    * [[https://arxiv.org/pdf/2008.07772.pdf|Liu et al 2020 - Very Deep Transformers for Neural Machine Translation]] ADMIN used to train very deep Transformers
  * SkipInit: [[https://arxiv.org/pdf/2002.10444.pdf|De & Smith 2020 - Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks]] Very cool paper. An initialization strategy for deep residual networks that is billed as an alternative to batch normalization. Related to [[https://arxiv.org/pdf/2003.04887.pdf|ReZero]]. From the ReZero paper: "The authors find that in deep ResNets without BatchNorm, a scalar multiplier is needed to ensure convergence."
===== Software Defaults =====
  * PyTorch 1.0 uses He initialization for most layers such as Linear, RNN, Conv2d, etc (see [[https://discuss.pytorch.org/t/whats-the-default-initialization-methods-for-layers/3157/20|this post]])

===== Resources =====
  * [[https://stats.stackexchange.com/questions/319323/whats-the-difference-between-variance-scaling-initializer-and-xavier-initialize]]
  * [[https://cs230.stanford.edu/section/4/|Stanford Xavier Initialization]]
  * Blog post about Glorot and He: [[https://pouannes.github.io/blog/initialization/|How to initialize deep neural networks? Xavier and Kaiming initialization]] It has some good math derivations for the methods

===== Related Pages =====
  * [[NN Training]]