Table of Contents

Model Compression

Model Compression

See also Sparsity in Neural Networks.

Overviews

General
LLMs and Transformers
Pruning
- Blalock et al 2020 - What is the State of Neural Network Pruning?
- Hoefler et al 2021 - Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
Distillation
- Xu et al 2024 - A Survey on Knowledge Distillation of Large Language Models

General Papers

Li et al 2020 - Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers Related: Scaling laws
Ganesh et al 2020 - Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Pruning & Sparsification

See also Sparsity.

LeCun et al 1990 - Optimal Brain Damage
Giles & Omlin 1994 - Pruning Recurrent Neural Networks for Improved Generalization Performance Old 1990s paper. Prunes nodes by magnitude of total incoming weights, rather than pruning edges like more recent work does.
See et al 2016 - Compression of Neural Machine Translation Models via Pruning
Frankle & Carbin 2019 - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks After training, only a subset of the edges in the network matter. The rest can be pruned (take only the edges with the highest weights) before training if you keep the same initialization weights.
LayerDrop: Fan et al 2019 - Reducing Transformer Depth on Demand with Structured Dropout Regularize networks to allow extraction of smaller networks of any depth at test time without needing to finetune. As easy to implement as dropout.
Sanh et al 2020 - Movement Pruning: Adaptive Sparsity by Fine-Tuning Works for transfer-learning of pre-trained models
Lagunas et al 2021 - Block Pruning For Faster Transformers
LLM Pruning

Quantization

After Training

During Training

Binarized Neural Networks

Courbariaux et al 2016 - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
Rastegari et al 2016 - XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
Anderson & Berg 2017 - The High-Dimensional Geometry of Binary Neural Networks
Liu et al 2018 - Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm Binarizes the weights and activations of the CNN, but includes real residual connections between layers to increase expressivity of the networks
Li & Su 2020 - Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs
Kurtz & Bah 2020 - An Integer Programming Approach to Deep Neural Networks with Binary Activation Functions
Xu et al 2021 - Learning Frequency Domain Approximation for Binary Neural Networks

Model Distillation

See also Knowledge Distillation.

Sanh et al 2019 - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Zhao et al 2021 - Extremely Small BERT Models from Mixed-Vocabulary Training Most of BERT's memory footprint comes from the input embeddings. This method uses model distillation to compress BERT large by an order of magnitude better than distilled models.

Parameter Sharing

HashedNets: Chen et al 2015 - Compressing Neural Networks with the Hashing Trick Randomly shares weights in the neural network by using hashing

Conferences and Workshops

MLSys

People

Daniel Soudry - Quantization and binarized NNs

Related Pages

Conditional Computation Invokes certain parts of the network for each instance, early exit, etc
Edge Computing
Efficient NNs
Knowledge Distillation
Sparsity in Neural Networks