Neural Network Initialization

Overviews

Glorot (Xavier) initialization: Glorot & Bengio 2010 - Understanding the Difficulty of Training Deep Feedforward Neural Networks (use with sigmoid activations)
- Intution: if you initialize the network randomly, then for a given neuron, if there are a lot of incoming connections, then they will tend to cancel and not saturate the neuron. BUT, if the network is deep, then if variance of the neurons goes up as you go to higher layers, then you'll have a problem because they will saturate as you go higher
He initialization: He et al 2015 (use with ReLu activations)
ADMIN: Liu et al 2020 - Understanding the Difficulty of Training Transformers
- Liu et al 2020 - Very Deep Transformers for Neural Machine Translation ADMIN used to train very deep Transformers
SkipInit: De & Smith 2020 - Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks Very cool paper. An initialization strategy for deep residual networks that is billed as an alternative to batch normalization. Related to ReZero. From the ReZero paper: “The authors find that in deep ResNets without BatchNorm, a scalar multiplier is needed to ensure convergence.”

PyTorch 1.0 uses He initialization for most layers such as Linear, RNN, Conv2d, etc (see this post)