====== Neural Network Tricks ====== ===== Overviews ===== * NLP 202 lecture: [[https://classes.soe.ucsc.edu/nlp202/Winter22/slides/nn-training.pdf|Training Deep Neural Networks (Winter 2022)]] * Training Tricks (see [[NN Training]]) * [[NN Initialization|Initialization]] * [[Normalization]] * [[Learning Rate|Learning Rate Schedule]] * Gradient clipping [[https://arxiv.org/pdf/1211.5063.pdf|Pascanu et al 2012]] * [[nlp:seq2seq#Scheduled Sampling]] * [[Curriculum Learning]] * Overcoming [[Catastrophic Forgetting]] * Adjust the batch size, or use gradient accumulation (see [[https://kozodoi.me/blog/20210219/gradient-accumulation|this blog]], for example) to simulate larger batch sizes * Try a different [[optimizers#modern_deep_learning_optimizers|optimizer]], such as [[ https://arxiv.org/pdf/1908.03265.pdf|RAdam]] * Adjust [[https://arxiv.org/pdf/2011.02150.pdf|epsilon]] in Adam * Fine-tuning Specific Tricks * [[https://aclanthology.org/2022.acl-short.76/|NoisyTune: A Little Noise Can Help You Finetune Pretrained Language Models Better]] : Before fine-tuning, adding a very small amount of uniform noise to each weight matrix can help performance (noise scaled by variance of the weight) * Regularization Tricks (see [[Regularization]]) * [[Regularization#Dropout]] * [[Ensembling]] * [[Knowledge Distillation]] (can improve performance by some type of regularization) * [[Regularization#Label Smoothing]] * Data Processing Tricks (see [[nlp:Data Preparation]]) * [[nlp:Tokenization#Subword Units]] (BPE, wordpiece, subword regularization, BPE dropout. Shared source and target vocabulary for subword units.) * [[https://arxiv.org/pdf/1608.05859.pdf|Shared source and target embeddings]] * Architecture Tricks (see [[NN Architectures]]) * Residual connections * [[https://arxiv.org/pdf/2003.04887.pdf|ReZero]] * Weight sharing * [[nlp:Attention Mechanisms|Attention]] * Copy mechanism * [[nlp:seq2seq|Seq2Seq]] and Generation Tricks * Try a different [[nlp:seq2seq#decoding_strategies|decoding method]] * [[https://arxiv.org/pdf/1904.09751.pdf|Nucleus sampling]] * [[https://arxiv.org/pdf/2010.02650.pdf|Uniform information density decoding]] * Reinforcement Learning Tricks * Efficiency Tricks * [[GPU Deep Learning]] * [[GPU Deep Learning#Memory Reduction Techniques|Gradient Checkpointing]] or [[https://arxiv.org/pdf/2202.08587.pdf|forward gradient]] * [[Model Compression]] * Tricks for [[Edge Computing]] * [[Knowledge Distillation]] ==== Older NN Tricks ==== * [[https://www.cs.utoronto.ca/~hinton/absps/sunspots.pdf|Nowlan & Hinton 1992 - Soft weight sharing]] ===== Related Pages ===== * [[NN Training]]