Differences

This shows you the differences between two versions of the page.

--- ml:efficient_nns [2025/04/02 05:03] – [Overviews] jmflanig
+++ ml:efficient_nns [2025/05/07 06:17] (current) – [Efficient Transformers] jmflanig
@@ Line 3: / Line 3: @@
 ===== Overviews =====
+  * **General**
+    * [[https://arxiv.org/pdf/1703.09039|Sze et al 2017 - Efficient Processing of Deep Neural Networks: A Tutorial and Survey]]
   * **For LLMs**
     * [[https://arxiv.org/pdf/2312.03863|Wan et al 2023 - Efficient Large Language Models: A Survey]]
@@ Line 12: / Line 14: @@
   * [[https://arxiv.org/pdf/2211.05102|Pope 2022 - Efficiently Scaling Transformer Inference]] Introduced the idea of the KV cache.
   * [[https://arxiv.org/pdf/2311.04934|Gim et al 2023 - Prompt Cache: Modular Attention Reuse for Low-Latency Inference]]
+  * [[https://arxiv.org/pdf/2306.14048|Zhang et al 2023 - H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models]] Removes tokens from the kv-cache, and keeps the most important ones (the heavy-hitters, H2s)
 ===== Related Pages =====