ml:efficient_nns

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
ml:efficient_nns [2025/05/07 06:16] – [Efficient Transformers] jmflanigml:efficient_nns [2025/05/07 06:17] (current) – [Efficient Transformers] jmflanig
Line 14: Line 14:
   * [[https://arxiv.org/pdf/2211.05102|Pope 2022 - Efficiently Scaling Transformer Inference]] Introduced the idea of the KV cache.   * [[https://arxiv.org/pdf/2211.05102|Pope 2022 - Efficiently Scaling Transformer Inference]] Introduced the idea of the KV cache.
   * [[https://arxiv.org/pdf/2311.04934|Gim et al 2023 - Prompt Cache: Modular Attention Reuse for Low-Latency Inference]]   * [[https://arxiv.org/pdf/2311.04934|Gim et al 2023 - Prompt Cache: Modular Attention Reuse for Low-Latency Inference]]
-  * [[Zhang et al 2023 - H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models]] Removes tokens from the kv-cache, and keeps the most important ones (the heavy-hitters, H2s)+  * [[https://arxiv.org/pdf/2306.14048|Zhang et al 2023 - H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models]] Removes tokens from the kv-cache, and keeps the most important ones (the heavy-hitters, H2s)
  
 ===== Related Pages ===== ===== Related Pages =====
ml/efficient_nns.1746598612.txt.gz · Last modified: 2025/05/07 06:16 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki