User Tools

Site Tools


ml:nn_training

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:nn_training [2022/06/28 21:19] – [Overviews] jmflanigml:nn_training [2024/07/09 22:29] (current) – [Topics] jmflanig
Line 19: Line 19:
   * [[Regularization]]   * [[Regularization]]
   * [[Fine-Tuning]] and [[nlp:Pretraining]]   * [[Fine-Tuning]] and [[nlp:Pretraining]]
-  * [[NN Tricks|Misc Tricks]]+  * **[[NN Tricks|Neural Network Tricks]]**
     * Tricks such as [[Curriculum Learning]], etc     * Tricks such as [[Curriculum Learning]], etc
     * [[nlp:Transformers#Training|Transformer Training Tricks]]     * [[nlp:Transformers#Training|Transformer Training Tricks]]
     * Residual connections, [[https://arxiv.org/pdf/2003.04887.pdf|ReZero]]     * Residual connections, [[https://arxiv.org/pdf/2003.04887.pdf|ReZero]]
 +    * [[https://arxiv.org/pdf/1710.03740|Mixed Precision Training]] (also [[https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html|Train With Mixed Precision - NVIDIA Docs]], see other papers as well)
   * [[Large-Scale]] and [[Distributed Training]]   * [[Large-Scale]] and [[Distributed Training]]
  
Line 37: Line 38:
   * Low-resource NMT system: [[https://arxiv.org/pdf/1905.11901.pdf|Sennrich & Zhang 2019 - Revisiting Low-Resource Neural Machine Translation: A Case Study]] Uses bideep RNN, label smoothing, different dropout rates for output word embeddings, input word embeddings and hidden layers, tied embeddings, layer normalization, tuned BPE vocabulary size (reduced from larger data scenarios). Trained with Adam with early-stopping on a dev set using BLEU.   * Low-resource NMT system: [[https://arxiv.org/pdf/1905.11901.pdf|Sennrich & Zhang 2019 - Revisiting Low-Resource Neural Machine Translation: A Case Study]] Uses bideep RNN, label smoothing, different dropout rates for output word embeddings, input word embeddings and hidden layers, tied embeddings, layer normalization, tuned BPE vocabulary size (reduced from larger data scenarios). Trained with Adam with early-stopping on a dev set using BLEU.
   * TODO: BART   * TODO: BART
 +  * [[https://aclanthology.org/2021.emnlp-main.831.pdf|Academic Budget BERT]]
 +  * [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turing NLG]]
 +  * [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]]
  
 ^ Paper ^ Architecture ^ Optimizer ^ Optimizer Hyperparameters ^ Initialization ^ Normalization ^ Regularizer ^ Learning Schedule ^ Stopping Criterion ^ Activation Function ^ Tokenization ^ Extras ^ ^ Paper ^ Architecture ^ Optimizer ^ Optimizer Hyperparameters ^ Initialization ^ Normalization ^ Regularizer ^ Learning Schedule ^ Stopping Criterion ^ Activation Function ^ Tokenization ^ Extras ^
ml/nn_training.1656451175.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki