User Tools

Site Tools


ml:efficient_nns

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
ml:efficient_nns [2025/03/26 20:10] – created jmflanigml:efficient_nns [2025/05/07 06:17] (current) – [Efficient Transformers] jmflanig
Line 3: Line 3:
  
 ===== Overviews ===== ===== Overviews =====
 +  * **General**
 +    * [[https://arxiv.org/pdf/1703.09039|Sze et al 2017 - Efficient Processing of Deep Neural Networks: A Tutorial and Survey]]
   * **For LLMs**   * **For LLMs**
     * [[https://arxiv.org/pdf/2312.03863|Wan et al 2023 - Efficient Large Language Models: A Survey]]     * [[https://arxiv.org/pdf/2312.03863|Wan et al 2023 - Efficient Large Language Models: A Survey]]
 +    * [[https://arxiv.org/pdf/2404.14294|Zhou et al 2024 - A Survey on Efficient Inference for Large Language Models]]
 +    * **Reasoning LLMs**
 +      * [[https://arxiv.org/pdf/2503.24377|Wang et al 2025 - Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models]]
 +
 +===== Efficient Transformers =====
 +  * [[https://arxiv.org/pdf/2211.05102|Pope 2022 - Efficiently Scaling Transformer Inference]] Introduced the idea of the KV cache.
 +  * [[https://arxiv.org/pdf/2311.04934|Gim et al 2023 - Prompt Cache: Modular Attention Reuse for Low-Latency Inference]]
 +  * [[https://arxiv.org/pdf/2306.14048|Zhang et al 2023 - H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models]] Removes tokens from the kv-cache, and keeps the most important ones (the heavy-hitters, H2s)
  
 ===== Related Pages ===== ===== Related Pages =====
 +  * [[Edge Computing]]
 +  * [[GPU Deep Learning]]
   * [[Model Compression]]   * [[Model Compression]]
 +  * [[Systems & ML]]
 +
ml/efficient_nns.1743019828.txt.gz · Last modified: 2025/03/26 20:10 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki