Differences

This shows you the differences between two versions of the page.

--- ml:nn_training [2022/05/15 00:05] – jmflanig
+++ ml:nn_training [2024/07/09 22:29] (current) – [Topics] jmflanig
@@ Line 5: / Line 5: @@
   * [[https://arxiv.org/pdf/1803.09820.pdf|Smith 2018 - A Disciplined Approach to Neural Network Hyper-parameters]] Leslie's opinon, but has some good insights
   * [[http://karpathy.github.io/2019/04/25/recipe/|Karpathy 2019 - A Recipe for Training Neural Networks]] Good advice, especially looking at the data
+  * [[https://classes.soe.ucsc.edu/nlp202/Winter22/slides/nn-training.pdf#page=9|NLP 202 Winter 2022 - Training Neural Networks (slides)]]
 ===== Topics =====
@@ Line 12: / Line 13: @@
   * [[NN Initialization|Initialization]]
   * [[Normalization]]
-  * Parameter Learning: see [[Optimizers]]
+  * [[Optimizers]]
-  * Choosing the Learning Rate
+  * [[Learning Rate]]
     * [[https://www.jeremyjordan.me/nn-learning-rate/|Blog Post: Setting the learning rate of your neural network]]
-    * See also [[Learning Rate]]
   * [[Loss Functions]]
   * [[Regularization]]
   * [[Fine-Tuning]] and [[nlp:Pretraining]]
-  * [[NN Tricks|Misc Tricks]]
+  * **[[NN Tricks|Neural Network Tricks]]**
     * Tricks such as [[Curriculum Learning]], etc
-    * [[nlp:Transformers#Training|Training Transformers]]
+    * [[nlp:Transformers#Training|Transformer Training Tricks]]
     * Residual connections, [[https://arxiv.org/pdf/2003.04887.pdf|ReZero]]
+    * [[https://arxiv.org/pdf/1710.03740|Mixed Precision Training]] (also [[https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html|Train With Mixed Precision - NVIDIA Docs]], see other papers as well)
   * [[Large-Scale]] and [[Distributed Training]]
@@ Line 34: / Line 35: @@
   * Transformer: [[https://arxiv.org/pdf/1804.00247.pdf|Popel & Bojar 2018 - Training Tips for the Transformer Model]]
   * **[[https://arxiv.org/pdf/1810.04805.pdf|BERT (Devlin et al 2019)]]**: Uses Adam with linear warmup, and linearly decaying stepsize.  Regularization: dropout and weight decay (L2 regularizer). Does not use label smoothing like the Transformer. "BERT is optimized with Adam (Kingma and Ba,2015) using the following parameters: β1 = 0.9, β2 = 0.999, espilon = 1e-6 and L2 weight decay of 0.01 with steps to a peak value of 1e-4, and then linearly decayed. BERT trains with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens." (from [[https://arxiv.org/pdf/1907.11692.pdf|Liu et al 2020]]) Trained for  1,000,000 steps.  A search over hyperparameters for training BERT is given in [[https://arxiv.org/pdf/1904.00962.pdf|You et al 2020]]. As noted in [[https://arxiv.org/pdf/2012.15832.pdf|Press 2020]], trains on shorter sentences (<128 tokens) for 90% of the training, before training on longer sentences (<512) for the last 10%.
+  * TODO: GPT-1 (p. 5)
   * Low-resource NMT system: [[https://arxiv.org/pdf/1905.11901.pdf|Sennrich & Zhang 2019 - Revisiting Low-Resource Neural Machine Translation: A Case Study]] Uses bideep RNN, label smoothing, different dropout rates for output word embeddings, input word embeddings and hidden layers, tied embeddings, layer normalization, tuned BPE vocabulary size (reduced from larger data scenarios). Trained with Adam with early-stopping on a dev set using BLEU.
-  * BART
+  * TODO: BART
+  * [[https://aclanthology.org/2021.emnlp-main.831.pdf|Academic Budget BERT]]
+  * [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turing NLG]]
+  * [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]]
 ^ Paper ^ Architecture ^ Optimizer ^ Optimizer Hyperparameters ^ Initialization ^ Normalization ^ Regularizer ^ Learning Schedule ^ Stopping Criterion ^ Activation Function ^ Tokenization ^ Extras ^