Training setups have evolved over time. Here are some from the literature.
| Paper | Architecture | Optimizer | Optimizer Hyperparameters | Initialization | Normalization | Regularizer | Learning Schedule | Stopping Criterion | Activation Function | Tokenization | Extras |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Bahdanau 2014 | Seq2seq BiLSTM + attention | Adadelta + gradient clipping (on norm) | espilon = 10−6, ρ = 0.95, gradient clipping = 1 | Random orthogonal + Gaussian | none | none | (no learning rate, set by Adadelta) | 5 days | LSTM & Tahn | Moses | |
| Ma & Hovy 2016 | BiLSTM +CNN word embds | SGD w/ momentum + gradient clipping | momentum = .9, gradient clipping = 5 | GloVe & Uniform [-sqrt(3/dim),sqrt(3/dim)] | Dropout (.5) | Rate decay | Early-stopping | Sigmoid & Tahn | Tokens | ||
| Gehring et al 2017 | CNN seq2seq | Nesterov + gradient clipping (on norm) | |||||||||
| Vaswani et al 2017 | Transformer | Adam | β1 = 0.9, β2 = 0.999, espilon = 1e-6 | Glorot/fan_avg | Layer normalization | Dropout, label smoothing | Linear warm-up + 1/sqrt(step_number) | 100,000 steps | ReLu | WordPiece | |
| BERT (Devlin et al 2019) | Transformer | Adam | β1 = 0.9, β2 = 0.999, espilon = 1e-6 | (Glorot/fan_avg?) | Layer normalization | L2 weight decay of 0.01 | Linear warmup + linear decay | 1,000,000 | GELU | WordPiece | Trains on short sentences first |