ml:model_compression
Table of Contents
Model Compression
See also Sparsity in Neural Networks.
Overviews
- General
- LLMs and Transformers
- Pruning
- Distillation
General Papers
Pruning & Sparsification
See also Sparsity.
- Giles & Omlin 1994 - Pruning Recurrent Neural Networks for Improved Generalization Performance Old 1990s paper. Prunes nodes by magnitude of total incoming weights, rather than pruning edges like more recent work does.
- Frankle & Carbin 2019 - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks After training, only a subset of the edges in the network matter. The rest can be pruned (take only the edges with the highest weights) before training if you keep the same initialization weights.
- LayerDrop: Fan et al 2019 - Reducing Transformer Depth on Demand with Structured Dropout Regularize networks to allow extraction of smaller networks of any depth at test time without needing to finetune. As easy to implement as dropout.
- Sanh et al 2020 - Movement Pruning: Adaptive Sparsity by Fine-Tuning Works for transfer-learning of pre-trained models
- LLM Pruning
Quantization
After Training
- Empirical Studies
During Training
- Hubara et al 2018 - Quantized Back-Propagation: Training Binarized Neural Networks with Quantized Gradients Interesting, but was rejected, see here.
Binarized Neural Networks
- Liu et al 2018 - Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm Binarizes the weights and activations of the CNN, but includes real residual connections between layers to increase expressivity of the networks
Model Distillation
See also Knowledge Distillation.
- Zhao et al 2021 - Extremely Small BERT Models from Mixed-Vocabulary Training Most of BERT's memory footprint comes from the input embeddings. This method uses model distillation to compress BERT large by an order of magnitude better than distilled models.
Parameter Sharing
- HashedNets: Chen et al 2015 - Compressing Neural Networks with the Hashing Trick Randomly shares weights in the neural network by using hashing
Conferences and Workshops
People
- Daniel Soudry - Quantization and binarized NNs
Related Pages
- Conditional Computation Invokes certain parts of the network for each instance, early exit, etc
ml/model_compression.txt · Last modified: 2025/05/12 09:00 by jmflanig