This is an old revision of the document!

Neural Network Initialization

Glorot (Xavier) initialization: Glorot & Bengio 2010 - Understanding the Difficulty of Training Deep Feedforward Neural Networks (use with sigmoid activations)
- Intution: if you initialize the network randomly, then for a given neuron, if there are a lot of incoming connections, then they will tend to cancel and not saturate the neuron. BUT, if the network is deep, then if variance of the neurons goes up as you go to higher layers, then you'll have a problem because they will saturate as you go higher
He initialization: He et al 2015 (use with ReLu activations)
ADMIN: Liu et al 2020
- Liu et al 2020 - Very Deep Transformers for Neural Machine Translation ADMIN used to train very deep Transformers

Software Defaults

PyTorch 1.0 uses He initialization for most layers such as Linear, RNN, Conv2d, etc (see this post)