User Tools

Site Tools


ml:model_compression

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:model_compression [2025/03/26 03:46] – [Related Pages] jmflanigml:model_compression [2025/05/12 09:00] (current) – [After Training] jmflanig
Line 1: Line 1:
 ====== Model Compression ====== ====== Model Compression ======
 +See also [[nn_sparsity|Sparsity in Neural Networks]].
  
 ===== Overviews ===== ===== Overviews =====
   * **General**   * **General**
     * [[https://arxiv.org/pdf/1710.09282|Cheng et al 2017 - A Survey of Model Compression and Acceleration for Deep Neural Networks]]     * [[https://arxiv.org/pdf/1710.09282|Cheng et al 2017 - A Survey of Model Compression and Acceleration for Deep Neural Networks]]
 +    * [[https://arxiv.org/pdf/2102.00554|Hoefler et al 2021 - Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks]]
     * [[https://www.mdpi.com/2073-431X/12/3/60|2023 - Model Compression for Deep Neural Networks: A Survey]]     * [[https://www.mdpi.com/2073-431X/12/3/60|2023 - Model Compression for Deep Neural Networks: A Survey]]
     * See chapter 4 of [[https://arxiv.org/pdf/2311.11883|2023 - Efficient Neural Networks for Tiny Machine Learning: A Comprehensive Review]]     * See chapter 4 of [[https://arxiv.org/pdf/2311.11883|2023 - Efficient Neural Networks for Tiny Machine Learning: A Comprehensive Review]]
Line 13: Line 15:
     * [[https://arxiv.org/pdf/2003.03033.pdf|Blalock et al 2020 - What is the State of Neural Network Pruning?]]     * [[https://arxiv.org/pdf/2003.03033.pdf|Blalock et al 2020 - What is the State of Neural Network Pruning?]]
     * [[https://arxiv.org/pdf/2102.00554|Hoefler et al 2021 - Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks]]     * [[https://arxiv.org/pdf/2102.00554|Hoefler et al 2021 - Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks]]
 +  * **Distillation**
 +    * [[https://arxiv.org/pdf/2402.13116|Xu et al 2024 - A Survey on Knowledge Distillation of Large Language Models]]
 +
  
 ===== General Papers ===== ===== General Papers =====
-  * [[https://arxiv.org/pdf/2002.11794.pdf|Li et al 2020 - Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers]]+  * [[https://arxiv.org/pdf/2002.11794.pdf|Li et al 2020 - Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers]] Related: [[Scaling laws]]
   * [[https://arxiv.org/pdf/2002.11985|Ganesh et al 2020 - Compressing Large-Scale Transformer-Based Models: A Case Study on BERT]]   * [[https://arxiv.org/pdf/2002.11985|Ganesh et al 2020 - Compressing Large-Scale Transformer-Based Models: A Case Study on BERT]]
  
Line 29: Line 34:
   * [[https://arxiv.org/pdf/2109.04838.pdf|Lagunas et al 2021 - Block Pruning For Faster Transformers]]   * [[https://arxiv.org/pdf/2109.04838.pdf|Lagunas et al 2021 - Block Pruning For Faster Transformers]]
   * **LLM Pruning**   * **LLM Pruning**
-    * See overview in section 2.1.2 of [[https://arxiv.org/pdf/2312.03863|Wan 2024]].+    * See overview in section 2.1.2 of [[https://arxiv.org/pdf/2312.03863|Wan 2024]] or section 4 of [[https://arxiv.org/pdf/2308.07633|Zhu 2023]].
     * [[https://arxiv.org/pdf/2301.00774|Frantar & Alistarh 2023 - SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot]]     * [[https://arxiv.org/pdf/2301.00774|Frantar & Alistarh 2023 - SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot]]
     * [[https://arxiv.org/pdf/2305.11627|Ma et al 2023 - LLM-Pruner: On the Structural Pruning of Large Language Models]] [[https://github.com/horseee/LLM-Pruner|github]]     * [[https://arxiv.org/pdf/2305.11627|Ma et al 2023 - LLM-Pruner: On the Structural Pruning of Large Language Models]] [[https://github.com/horseee/LLM-Pruner|github]]
Line 36: Line 41:
     * [[https://arxiv.org/pdf/2310.06694|Xia et al 2023 - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning]]     * [[https://arxiv.org/pdf/2310.06694|Xia et al 2023 - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning]]
     * [[https://arxiv.org/pdf/2402.02834|Kim et al 2024 - Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods]]     * [[https://arxiv.org/pdf/2402.02834|Kim et al 2024 - Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods]]
 +    * [[https://arxiv.org/pdf/2403.03853|Men et al 2024 - ShortGPT: Layers in Large Language Models are More Redundant Than You Expect]]
  
 ===== Quantization ====== ===== Quantization ======
Line 44: Line 50:
   * [[https://www.aclweb.org/anthology/2020.ngt-1.4.pdf|Aji & Heafield 2020 - Compressing Neural Machine Translation Models with 4-bit Precision]]   * [[https://www.aclweb.org/anthology/2020.ngt-1.4.pdf|Aji & Heafield 2020 - Compressing Neural Machine Translation Models with 4-bit Precision]]
   * [[https://arxiv.org/pdf/2101.01321.pdf|Kim et al 2020 - I-BERT: Integer-only BERT Quantization]]   * [[https://arxiv.org/pdf/2101.01321.pdf|Kim et al 2020 - I-BERT: Integer-only BERT Quantization]]
 +  * **Empirical Studies**
 +    * [[https://arxiv.org/pdf/2505.02214|Zheng et al 2025 - An Empirical Study of Qwen3 Quantization]]
 +    * [[https://aclanthology.org/2024.lrec-main.461.pdf|Liu et al 2024 - Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study]]
 +    * [[https://arxiv.org/pdf/2504.04823?|Liu et al 2025 - Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models]]
  
 ==== During Training ==== ==== During Training ====
Line 67: Line 77:
   * [[https://arxiv.org/pdf/1910.01108.pdf|Sanh et al 2019 - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter]]   * [[https://arxiv.org/pdf/1910.01108.pdf|Sanh et al 2019 - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter]]
   * [[https://www.aclweb.org/anthology/2021.eacl-main.238.pdf|Zhao et al 2021 - Extremely Small BERT Models from Mixed-Vocabulary Training]] Most of BERT's memory footprint comes from the input embeddings.  This method uses model distillation to compress BERT large by an order of magnitude better than distilled models.   * [[https://www.aclweb.org/anthology/2021.eacl-main.238.pdf|Zhao et al 2021 - Extremely Small BERT Models from Mixed-Vocabulary Training]] Most of BERT's memory footprint comes from the input embeddings.  This method uses model distillation to compress BERT large by an order of magnitude better than distilled models.
 +
 +===== Parameter Sharing =====
 +  * HashedNets: [[https://arxiv.org/pdf/1504.04788|Chen et al 2015 - Compressing Neural Networks with the Hashing Trick]] Randomly shares weights in the neural network by using hashing
  
 ===== Conferences and Workshops ===== ===== Conferences and Workshops =====
Line 76: Line 89:
  
 ===== Related Pages ===== ===== Related Pages =====
-  * [[Conditional Computation]] Invokes certain parts of the network for each instance+  * [[Conditional Computation]] Invokes certain parts of the network for each instance, early exit, etc
   * [[Edge Computing]]   * [[Edge Computing]]
 +  * [[Efficient NNs]]
   * [[Knowledge Distillation]]   * [[Knowledge Distillation]]
 +  * [[nn_sparsity|Sparsity in Neural Networks]]
  
ml/model_compression.1742960775.txt.gz · Last modified: 2025/03/26 03:46 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki