====== Model Compression ======
See also [[nn_sparsity|Sparsity in Neural Networks]].

===== Overviews =====
  * **General**
    * [[https://arxiv.org/pdf/1710.09282|Cheng et al 2017 - A Survey of Model Compression and Acceleration for Deep Neural Networks]]
    * [[https://arxiv.org/pdf/2102.00554|Hoefler et al 2021 - Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks]]
    * [[https://www.mdpi.com/2073-431X/12/3/60|2023 - Model Compression for Deep Neural Networks: A Survey]]
    * See chapter 4 of [[https://arxiv.org/pdf/2311.11883|2023 - Efficient Neural Networks for Tiny Machine Learning: A Comprehensive Review]]
  * **LLMs and Transformers**
    * [[https://arxiv.org/pdf/2202.07105|Xu & McAuley et al 2022 - A Survey on Model Compression and Acceleration for Pretrained Language Models]]
    * **[[https://arxiv.org/pdf/2308.07633|Zhu et al 2023 - A Survey on Model Compression for Large Language Models]]**
    * [[https://arxiv.org/pdf/2402.05964|Tang et al 2024 - A Survey on Transformer Compression]]
  * **Pruning**
    * [[https://arxiv.org/pdf/2003.03033.pdf|Blalock et al 2020 - What is the State of Neural Network Pruning?]]
    * [[https://arxiv.org/pdf/2102.00554|Hoefler et al 2021 - Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks]]
  * **Distillation**
    * [[https://arxiv.org/pdf/2402.13116|Xu et al 2024 - A Survey on Knowledge Distillation of Large Language Models]]


===== General Papers =====
  * [[https://arxiv.org/pdf/2002.11794.pdf|Li et al 2020 - Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers]] Related: [[Scaling laws]]
  * [[https://arxiv.org/pdf/2002.11985|Ganesh et al 2020 - Compressing Large-Scale Transformer-Based Models: A Case Study on BERT]]

===== Pruning & Sparsification =====
See also [[nn_sparsity|Sparsity]].

  * [[https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf|LeCun et al 1990 - Optimal Brain Damage]]
  * [[https://clgiles.ist.psu.edu/papers/IEEE.TNN.Pruning.94.pdf|Giles & Omlin 1994 - Pruning Recurrent Neural Networks for Improved Generalization Performance]] Old 1990s paper. Prunes nodes by magnitude of total incoming weights, rather than pruning edges like more recent work does.
  * [[https://arxiv.org/pdf/1606.09274.pdf|See et al 2016 - Compression of Neural Machine Translation Models via Pruning]]
  * [[https://arxiv.org/abs/1803.03635|Frankle & Carbin 2019 - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks]]  After training, only a subset of the edges in the network matter.  The rest can be pruned (take only the edges with the highest weights) before training if you keep the same initialization weights.
  * LayerDrop: [[https://arxiv.org/pdf/1909.11556.pdf|Fan et al 2019 - Reducing Transformer Depth on Demand with Structured Dropout]] Regularize networks to allow extraction of smaller networks of any depth at test time without needing to finetune.  As easy to implement as dropout.
  * [[https://arxiv.org/pdf/2005.07683.pdf|Sanh et al 2020 - Movement Pruning: Adaptive Sparsity by Fine-Tuning]] Works for transfer-learning of pre-trained models
  * [[https://arxiv.org/pdf/2109.04838.pdf|Lagunas et al 2021 - Block Pruning For Faster Transformers]]
  * **LLM Pruning**
    * See overview in section 2.1.2 of [[https://arxiv.org/pdf/2312.03863|Wan 2024]] or section 4 of [[https://arxiv.org/pdf/2308.07633|Zhu 2023]].
    * [[https://arxiv.org/pdf/2301.00774|Frantar & Alistarh 2023 - SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot]]
    * [[https://arxiv.org/pdf/2305.11627|Ma et al 2023 - LLM-Pruner: On the Structural Pruning of Large Language Models]] [[https://github.com/horseee/LLM-Pruner|github]]
    * [[https://arxiv.org/pdf/2306.11695|Sun et al 2023 - A Simple and Effective Pruning Approach for Large Language Models]]
    * SIMPLE: [[https://aclanthology.org/2023.findings-acl.692.pdf|Tao et al 2023 - Structured Pruning for Efficient Generative Pre-trained Language Models]]
    * [[https://arxiv.org/pdf/2310.06694|Xia et al 2023 - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning]]
    * [[https://arxiv.org/pdf/2402.02834|Kim et al 2024 - Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods]]
    * [[https://arxiv.org/pdf/2403.03853|Men et al 2024 - ShortGPT: Layers in Large Language Models are More Redundant Than You Expect]]

===== Quantization ======

==== After Training ====

  * [[https://arxiv.org/pdf/1909.05840.pdf|Shen et al 2019 - Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT]]
  * [[https://www.aclweb.org/anthology/2020.ngt-1.4.pdf|Aji & Heafield 2020 - Compressing Neural Machine Translation Models with 4-bit Precision]]
  * [[https://arxiv.org/pdf/2101.01321.pdf|Kim et al 2020 - I-BERT: Integer-only BERT Quantization]]
  * **Empirical Studies**
    * [[https://arxiv.org/pdf/2505.02214|Zheng et al 2025 - An Empirical Study of Qwen3 Quantization]]
    * [[https://aclanthology.org/2024.lrec-main.461.pdf|Liu et al 2024 - Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study]]
    * [[https://arxiv.org/pdf/2504.04823?|Liu et al 2025 - Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models]]

==== During Training ====
  * [[https://arxiv.org/pdf/1606.06160.pdf|Zhou et al 2016 - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients]]
  * [[https://www.jmlr.org/papers/volume18/16-456/16-456.pdf|Hubara et al 2018 - Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations]]
  * [[https://openreview.net/pdf?id=Bye10KkwG|Hubara et al 2018 - Quantized Back-Propagation: Training Binarized Neural Networks with Quantized Gradients]] Interesting, but was rejected, see [[https://openreview.net/forum?id=Bye10KkwG|here]].
  * [[https://arxiv.org/pdf/1910.06188.pdf|2019 - Q8BERT: Quantized 8Bit BERT]] [[https://intellabs.github.io/nlp-architect/quantized_bert.html|Software]]
  * [[https://arxiv.org/pdf/2112.10769.pdf|Chmiel et al 2021 2021 - Logarithmic Unbiased Quantization: Simple 4-bit Training in Deep Learning]]
  * [[https://arxiv.org/pdf/2501.02423|Sun et al 2025 - Scaling Laws for Floating Point Quantization Training]]

==== Binarized Neural Networks ====
  * [[https://arxiv.org/pdf/1602.02830.pdf|Courbariaux et al 2016 - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1]]
  * [[https://arxiv.org/pdf/1603.05279.pdf|Rastegari et al 2016 - XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks]]
  * [[https://arxiv.org/pdf/1705.07199.pdf|Anderson & Berg 2017 - The High-Dimensional Geometry of Binary Neural Networks]]
  * [[https://arxiv.org/pdf/1808.00278.pdf|Liu et al 2018 - Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm]] Binarizes the weights and activations of the CNN, but includes real residual connections between layers to increase expressivity of the networks
  * [[https://arxiv.org/pdf/2006.16578.pdf|Li & Su 2020 - Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs]]
  * [[http://www.optimization-online.org/DB_FILE/2020/07/7883.pdf|Kurtz & Bah 2020 - An Integer Programming Approach to Deep Neural Networks with Binary Activation Functions]]
  * [[https://arxiv.org/pdf/2103.00841.pdf|Xu et al 2021 - Learning Frequency Domain Approximation for Binary Neural Networks]]

===== Model Distillation =====
See also [[Knowledge Distillation]].

  * [[https://arxiv.org/pdf/1910.01108.pdf|Sanh et al 2019 - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter]]
  * [[https://www.aclweb.org/anthology/2021.eacl-main.238.pdf|Zhao et al 2021 - Extremely Small BERT Models from Mixed-Vocabulary Training]] Most of BERT's memory footprint comes from the input embeddings.  This method uses model distillation to compress BERT large by an order of magnitude better than distilled models.

===== Parameter Sharing =====
  * HashedNets: [[https://arxiv.org/pdf/1504.04788|Chen et al 2015 - Compressing Neural Networks with the Hashing Trick]] Randomly shares weights in the neural network by using hashing

===== Conferences and Workshops =====
  * [[https://mlsys.org/|MLSys]]

===== People =====
  * [[https://scholar.google.com/citations?user=AEBWEm8AAAAJ&hl=en|Daniel Soudry]] - Quantization and binarized NNs


===== Related Pages =====
  * [[Conditional Computation]] Invokes certain parts of the network for each instance, early exit, etc
  * [[Edge Computing]]
  * [[Efficient NNs]]
  * [[Knowledge Distillation]]
  * [[nn_sparsity|Sparsity in Neural Networks]]