User Tools

Site Tools


ml:nn_training

Neural Network Training

Overviews

Topics

Training Setups in the Literature

Training setups have evolved over time. Here are some from the literature.

  • Bahdanau 2014: Minibatch stochastic gradient descent (SGD) with Adadelta, trained for 5 days. Recurrent weight matrices initialized as random orthogonal matrices. For feedforward weight matrices, initialized by sampling each element from the Gaussian distribution of mean 0 and variance 0.00. Biases initialized to 0.
  • Ma & Hovy 2016: Minibatch SGD (batch size 10) with momentum (.9), gradient clipping (5.0) and learning rate decay (this setup performed similarly to Adam for them). Early-stopping on validation set. Dropout with rate = .5.
  • Gehring et al 2017 - Convolutional Sequence to Sequence Learning They used a lot of tricks. Worth taking a look.
  • Transformer (Vaswani et al 2017): Uses Adam with warmup, residual dropout, and label smoothing. Adam parameters β1 = 0.9, β2 = 0.98 and epsilon = 10−9. Warmup increases “the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number.” Trained for 100,000 steps (12 hours).
  • BERT (Devlin et al 2019): Uses Adam with linear warmup, and linearly decaying stepsize. Regularization: dropout and weight decay (L2 regularizer). Does not use label smoothing like the Transformer. “BERT is optimized with Adam (Kingma and Ba,2015) using the following parameters: β1 = 0.9, β2 = 0.999, espilon = 1e-6 and L2 weight decay of 0.01 with steps to a peak value of 1e-4, and then linearly decayed. BERT trains with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens.” (from Liu et al 2020) Trained for 1,000,000 steps. A search over hyperparameters for training BERT is given in You et al 2020. As noted in Press 2020, trains on shorter sentences (<128 tokens) for 90% of the training, before training on longer sentences (<512) for the last 10%.
  • TODO: GPT-1 (p. 5)
  • Low-resource NMT system: Sennrich & Zhang 2019 - Revisiting Low-Resource Neural Machine Translation: A Case Study Uses bideep RNN, label smoothing, different dropout rates for output word embeddings, input word embeddings and hidden layers, tied embeddings, layer normalization, tuned BPE vocabulary size (reduced from larger data scenarios). Trained with Adam with early-stopping on a dev set using BLEU.
  • TODO: BART
Paper Architecture Optimizer Optimizer Hyperparameters Initialization Normalization Regularizer Learning Schedule Stopping Criterion Activation Function Tokenization Extras
Bahdanau 2014 Seq2seq BiLSTM + attention Adadelta + gradient clipping (on norm) espilon = 10−6, ρ = 0.95, gradient clipping = 1 Random orthogonal + Gaussian none none (no learning rate, set by Adadelta) 5 days LSTM & Tahn Moses
Ma & Hovy 2016 BiLSTM +CNN word embds SGD w/ momentum + gradient clipping momentum = .9, gradient clipping = 5 GloVe & Uniform [-sqrt(3/dim),sqrt(3/dim)] Dropout (.5) Rate decay Early-stopping Sigmoid & Tahn Tokens
Gehring et al 2017 CNN seq2seq Nesterov + gradient clipping (on norm)
Vaswani et al 2017 Transformer Adam β1 = 0.9, β2 = 0.999, espilon = 1e-6 Glorot/fan_avg Layer normalization Dropout, label smoothing Linear warm-up + 1/sqrt(step_number) 100,000 steps ReLu WordPiece
BERT (Devlin et al 2019) Transformer Adam β1 = 0.9, β2 = 0.999, espilon = 1e-6 (Glorot/fan_avg?) Layer normalization L2 weight decay of 0.01 Linear warmup + linear decay 1,000,000 GELU WordPiece Trains on short sentences first

Alternative Training Methods

ml/nn_training.txt · Last modified: 2024/07/09 22:29 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki