ml:nn_tricks

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:nn_tricks [2021/08/26 21:06] – [Neural Network Tricks] jmflanigml:nn_tricks [2023/10/11 22:19] (current) jmflanig
Line 1: Line 1:
 ====== Neural Network Tricks ====== ====== Neural Network Tricks ======
 +
 +===== Overviews =====
 +  * NLP 202 lecture: [[https://classes.soe.ucsc.edu/nlp202/Winter22/slides/nn-training.pdf|Training Deep Neural Networks (Winter 2022)]]
  
   * Training Tricks (see [[NN Training]])   * Training Tricks (see [[NN Training]])
Line 9: Line 12:
     * [[Curriculum Learning]]     * [[Curriculum Learning]]
     * Overcoming [[Catastrophic Forgetting]]     * Overcoming [[Catastrophic Forgetting]]
 +    * Adjust the batch size, or use gradient accumulation (see [[https://kozodoi.me/blog/20210219/gradient-accumulation|this blog]], for example) to simulate larger batch sizes
 +    * Try a different [[optimizers#modern_deep_learning_optimizers|optimizer]], such as [[ https://arxiv.org/pdf/1908.03265.pdf|RAdam]]
 +    * Adjust [[https://arxiv.org/pdf/2011.02150.pdf|epsilon]] in Adam
 +  * Fine-tuning Specific Tricks
 +    * [[https://aclanthology.org/2022.acl-short.76/|NoisyTune: A Little Noise Can Help You Finetune Pretrained Language Models Better]] : Before fine-tuning, adding a very small amount of uniform noise to each weight matrix can help performance (noise scaled by variance of the weight)  
   * Regularization Tricks (see [[Regularization]])   * Regularization Tricks (see [[Regularization]])
     * [[Regularization#Dropout]]     * [[Regularization#Dropout]]
     * [[Ensembling]]     * [[Ensembling]]
     * [[Knowledge Distillation]] (can improve performance by some type of regularization)     * [[Knowledge Distillation]] (can improve performance by some type of regularization)
 +    * [[Regularization#Label Smoothing]]
   * Data Processing Tricks (see [[nlp:Data Preparation]])   * Data Processing Tricks (see [[nlp:Data Preparation]])
     * [[nlp:Tokenization#Subword Units]] (BPE, wordpiece, subword regularization, BPE dropout.  Shared source and target vocabulary for subword units.)     * [[nlp:Tokenization#Subword Units]] (BPE, wordpiece, subword regularization, BPE dropout.  Shared source and target vocabulary for subword units.)
-    * Shared source and target embeddings+    * [[https://arxiv.org/pdf/1608.05859.pdf|Shared source and target embeddings]]
   * Architecture Tricks (see [[NN Architectures]])   * Architecture Tricks (see [[NN Architectures]])
     * Residual connections     * Residual connections
Line 22: Line 31:
     * [[nlp:Attention Mechanisms|Attention]]     * [[nlp:Attention Mechanisms|Attention]]
     * Copy mechanism     * Copy mechanism
 +  * [[nlp:seq2seq|Seq2Seq]] and Generation Tricks
 +    * Try a different [[nlp:seq2seq#decoding_strategies|decoding method]]
 +      * [[https://arxiv.org/pdf/1904.09751.pdf|Nucleus sampling]]
 +      * [[https://arxiv.org/pdf/2010.02650.pdf|Uniform information density decoding]]
 +  * Reinforcement Learning Tricks
   * Efficiency Tricks   * Efficiency Tricks
     * [[GPU Deep Learning]]     * [[GPU Deep Learning]]
-    * [[GPU Deep Learning#Memory Reduction Techniques|Gradient Checkpointing]]+    * [[GPU Deep Learning#Memory Reduction Techniques|Gradient Checkpointing]] or [[https://arxiv.org/pdf/2202.08587.pdf|forward gradient]]
     * [[Model Compression]]     * [[Model Compression]]
     * Tricks for [[Edge Computing]]     * Tricks for [[Edge Computing]]
ml/nn_tricks.1630011972.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki