====== Knowledge Distillation ======
Various papers related to distillation.  From [[https://arxiv.org/pdf/2006.11316.pdf|Iandola 2020]]: "While the term 'knowledge distillation' was coined by Hinton et al. 2015 to describe a specific method and equation, the term 'distillation' is now used in reference to a diverse range of approaches where a 'student' network is trained to replicate a 'teacher' network."

===== Overviews =====
  * Section 4.2.2 of [[https://arxiv.org/pdf/2006.11316.pdf|Iandola 2020]]
  * [[https://arxiv.org/pdf/2402.13116|Xu et al 2024 - A Survey on Knowledge Distillation of Large Language Models]]

===== Papers =====
  * [[https://arxiv.org/pdf/1503.02531.pdf|Hinton et al 2015 - Distilling the Knowledge in a Neural Network]] (The paper that introduced **knowledge distillation**.)
  * [[https://arxiv.org/pdf/1606.07947.pdf|Kim & Rush 2016 - Sequence-Level Knowledge Distillation]] First paper applying knowledge distillation to seq2seq models.
  * Multi-step KD: [[https://arxiv.org/pdf/1902.03393.pdf|Mirzadeh et al 2019 - Improved Knowledge Distillation via Teacher Assistant]]
  * [[https://arxiv.org/pdf/2104.06457.pdf|Inaguma et al 2021 - Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation]]

===== Related Pages =====
  * [[Ensembling]]
  * [[Model Compression]]