Table of Contents

Efficient Neural Networks

Efficient Neural Networks

Methods having to do with efficiency in neural networks.

Overviews

Efficient Transformers

Pope 2022 - Efficiently Scaling Transformer Inference Introduced the idea of the KV cache.
Gim et al 2023 - Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Zhang et al 2023 - H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models Removes tokens from the kv-cache, and keeps the most important ones (the heavy-hitters, H2s)

Related Pages