This is an old revision of the document!

Neural Network Tricks

Training Tricks (see NN Training)
- Initialization
- Normalization
- Learning Rate Schedule
- Gradient clipping Pascanu et al 2012
- Scheduled Sampling
- Curriculum Learning
- Overcoming Catastrophic Forgetting
- Adjust the batch size, or use gradient accumulation to simulate larger batch sizes
- Try a different optimizer, such as RAdam
- Adjust epsilon in Adam
Regularization Tricks (see Regularization)
- Dropout
- Ensembling
- Knowledge Distillation (can improve performance by some type of regularization)
- Label Smoothing
Data Processing Tricks (see Data Preparation)
- Subword Units (BPE, wordpiece, subword regularization, BPE dropout. Shared source and target vocabulary for subword units.)
- Shared source and target embeddings
Architecture Tricks (see NN Architectures)
- Residual connections
- ReZero
- Weight sharing
- Attention
- Copy mechanism
Seq2Seq and Generation Tricks
- Try a different decoding method
  - Nucleus sampling
  - Uniform information density decoding
Reinforcement Learning Tricks
Efficiency Tricks
- GPU Deep Learning
- Gradient Checkpointing or forward gradient
- Model Compression
- Tricks for Edge Computing
- Knowledge Distillation

Related Pages