Table of Contents
Model Compression
Overviews
General Papers
Pruning & Sparsification
Quantization
After Training
During Training
Binarized Neural Networks
Model Distillation
Parameter Sharing
Conferences and Workshops
People
Related Pages
Model Compression
See also
Sparsity in Neural Networks
.
Overviews
General
Cheng et al 2017 - A Survey of Model Compression and Acceleration for Deep Neural Networks
Hoefler et al 2021 - Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
2023 - Model Compression for Deep Neural Networks: A Survey
See chapter 4 of
2023 - Efficient Neural Networks for Tiny Machine Learning: A Comprehensive Review
LLMs and Transformers
Xu & McAuley et al 2022 - A Survey on Model Compression and Acceleration for Pretrained Language Models
Zhu et al 2023 - A Survey on Model Compression for Large Language Models
Tang et al 2024 - A Survey on Transformer Compression
Pruning
Blalock et al 2020 - What is the State of Neural Network Pruning?
Hoefler et al 2021 - Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
Distillation
Xu et al 2024 - A Survey on Knowledge Distillation of Large Language Models
General Papers
Li et al 2020 - Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
Related:
Scaling laws
Ganesh et al 2020 - Compressing Large-Scale Transformer-Based Models: A Case Study on BERT
Pruning & Sparsification
See also
Sparsity
.
LeCun et al 1990 - Optimal Brain Damage
Giles & Omlin 1994 - Pruning Recurrent Neural Networks for Improved Generalization Performance
Old 1990s paper. Prunes nodes by magnitude of total incoming weights, rather than pruning edges like more recent work does.
See et al 2016 - Compression of Neural Machine Translation Models via Pruning
Frankle & Carbin 2019 - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
After training, only a subset of the edges in the network matter. The rest can be pruned (take only the edges with the highest weights) before training if you keep the same initialization weights.
LayerDrop:
Fan et al 2019 - Reducing Transformer Depth on Demand with Structured Dropout
Regularize networks to allow extraction of smaller networks of any depth at test time without needing to finetune. As easy to implement as dropout.
Sanh et al 2020 - Movement Pruning: Adaptive Sparsity by Fine-Tuning
Works for transfer-learning of pre-trained models
Lagunas et al 2021 - Block Pruning For Faster Transformers
LLM Pruning
See overview in section 2.1.2 of
Wan 2024
or section 4 of
Zhu 2023
.
Frantar & Alistarh 2023 - SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Ma et al 2023 - LLM-Pruner: On the Structural Pruning of Large Language Models
github
Sun et al 2023 - A Simple and Effective Pruning Approach for Large Language Models
SIMPLE:
Tao et al 2023 - Structured Pruning for Efficient Generative Pre-trained Language Models
Xia et al 2023 - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Kim et al 2024 - Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods
Men et al 2024 - ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Quantization
After Training
Shen et al 2019 - Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
Aji & Heafield 2020 - Compressing Neural Machine Translation Models with 4-bit Precision
Kim et al 2020 - I-BERT: Integer-only BERT Quantization
Empirical Studies
Zheng et al 2025 - An Empirical Study of Qwen3 Quantization
Liu et al 2024 - Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
Liu et al 2025 - Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
During Training
Zhou et al 2016 - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
Hubara et al 2018 - Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
Hubara et al 2018 - Quantized Back-Propagation: Training Binarized Neural Networks with Quantized Gradients
Interesting, but was rejected, see
here
.
2019 - Q8BERT: Quantized 8Bit BERT
Software
Chmiel et al 2021 2021 - Logarithmic Unbiased Quantization: Simple 4-bit Training in Deep Learning
Sun et al 2025 - Scaling Laws for Floating Point Quantization Training
Binarized Neural Networks
Courbariaux et al 2016 - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
Rastegari et al 2016 - XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
Anderson & Berg 2017 - The High-Dimensional Geometry of Binary Neural Networks
Liu et al 2018 - Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm
Binarizes the weights and activations of the CNN, but includes real residual connections between layers to increase expressivity of the networks
Li & Su 2020 - Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs
Kurtz & Bah 2020 - An Integer Programming Approach to Deep Neural Networks with Binary Activation Functions
Xu et al 2021 - Learning Frequency Domain Approximation for Binary Neural Networks
Model Distillation
See also
Knowledge Distillation
.
Sanh et al 2019 - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Zhao et al 2021 - Extremely Small BERT Models from Mixed-Vocabulary Training
Most of BERT's memory footprint comes from the input embeddings. This method uses model distillation to compress BERT large by an order of magnitude better than distilled models.
Parameter Sharing
HashedNets:
Chen et al 2015 - Compressing Neural Networks with the Hashing Trick
Randomly shares weights in the neural network by using hashing
Conferences and Workshops
MLSys
People
Daniel Soudry
- Quantization and binarized NNs
Related Pages
Conditional Computation
Invokes certain parts of the network for each instance, early exit, etc
Edge Computing
Efficient NNs
Knowledge Distillation
Sparsity in Neural Networks