Table of Contents

Neural Network Training

Overviews

Topics

Training Setups in the Literature

Training setups have evolved over time. Here are some from the literature.

Paper Architecture Optimizer Optimizer Hyperparameters Initialization Normalization Regularizer Learning Schedule Stopping Criterion Activation Function Tokenization Extras
Bahdanau 2014 Seq2seq BiLSTM + attention Adadelta + gradient clipping (on norm) espilon = 10−6, ρ = 0.95, gradient clipping = 1 Random orthogonal + Gaussian none none (no learning rate, set by Adadelta) 5 days LSTM & Tahn Moses
Ma & Hovy 2016 BiLSTM +CNN word embds SGD w/ momentum + gradient clipping momentum = .9, gradient clipping = 5 GloVe & Uniform [-sqrt(3/dim),sqrt(3/dim)] Dropout (.5) Rate decay Early-stopping Sigmoid & Tahn Tokens
Gehring et al 2017 CNN seq2seq Nesterov + gradient clipping (on norm)
Vaswani et al 2017 Transformer Adam β1 = 0.9, β2 = 0.999, espilon = 1e-6 Glorot/fan_avg Layer normalization Dropout, label smoothing Linear warm-up + 1/sqrt(step_number) 100,000 steps ReLu WordPiece
BERT (Devlin et al 2019) Transformer Adam β1 = 0.9, β2 = 0.999, espilon = 1e-6 (Glorot/fan_avg?) Layer normalization L2 weight decay of 0.01 Linear warmup + linear decay 1,000,000 GELU WordPiece Trains on short sentences first

Alternative Training Methods

See Neural Networks: Alternative Training Methods