====== Efficient Neural Networks ====== Methods having to do with efficiency in neural networks. ===== Overviews ===== * **General** * [[https://arxiv.org/pdf/1703.09039|Sze et al 2017 - Efficient Processing of Deep Neural Networks: A Tutorial and Survey]] * **For LLMs** * [[https://arxiv.org/pdf/2312.03863|Wan et al 2023 - Efficient Large Language Models: A Survey]] * [[https://arxiv.org/pdf/2404.14294|Zhou et al 2024 - A Survey on Efficient Inference for Large Language Models]] * **Reasoning LLMs** * [[https://arxiv.org/pdf/2503.24377|Wang et al 2025 - Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models]] ===== Efficient Transformers ===== * [[https://arxiv.org/pdf/2211.05102|Pope 2022 - Efficiently Scaling Transformer Inference]] Introduced the idea of the KV cache. * [[https://arxiv.org/pdf/2311.04934|Gim et al 2023 - Prompt Cache: Modular Attention Reuse for Low-Latency Inference]] * [[https://arxiv.org/pdf/2306.14048|Zhang et al 2023 - H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models]] Removes tokens from the kv-cache, and keeps the most important ones (the heavy-hitters, H2s) ===== Related Pages ===== * [[Edge Computing]] * [[GPU Deep Learning]] * [[Model Compression]] * [[Systems & ML]]