====== Attention Mechanisms ======

===== Overviews =====
  * [[https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html|Attention? Attention! Blog post by Weng]]
  * [[https://towardsdatascience.com/attention-networks-c735befb5e9f|Brief Introduction to Attention Models]]
  * [[https://arxiv.org/pdf/1904.02874.pdf|Chaudhari et al 2019 - An Attentive Survey of Attention Models]]

===== Summary of Attention Mechanisms =====
From [[https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html|Attention? Attention! Blog post by Weng]]:
{{media:attention-mechanisms.png}}

===== Key Papers =====
  * [[https://arxiv.org/pdf/1308.0850.pdf|Graves 2013 - Generating Sequences With Recurrent Neural Networks]] Uses an alignment mechanism for handwriting generation, similar to the attention mechanism.  The [[https://www.deeplearningbook.org/contents/rnn.html|Deep Learning Book]] p. 415 at the end of Ch 10 says "The idea of attention mechanisms for neural networks was introduced even earlier, in the context of handwriting generation (Graves, 2013), with an attention mechanism that was constrained to move only forward in time through the sequence."
  * **[[https://arxiv.org/pdf/1409.0473.pdf|Bahdanau et al 2014 - Neural Machine Translation by Jointly Learning to Align and Translate]]** The paper that started it all.  Introduced the attention mechanism and initiated the deep learning revolution in NLP.  Basically, the paper that got neural machine translation to actually work.
  * [[https://arxiv.org/pdf/1412.7755.pdf|Ba et al 2014 - Multiple Object Recognition with Visual Attention]]
  * **[[https://arxiv.org/pdf/1508.04025.pdf|Luong et al 2015 - Effective Approaches to Attention-based Neural Machine Translation]]** This paper introduced dot product attention.
  * **[[https://arxiv.org/pdf/1601.06733.pdf|Cheng et al 2016 - Long Short-Term Memory-Networks for Machine Reading]]** This paper introduced self-attention, called intra-attention (introduced in the Long Short-Term Memory-Networks, LSTMNs, section 3.2).  See Fig 1 for a picture.
  * [[https://arxiv.org/pdf/1606.01933.pdf|Parikh et al 2016 - A Decomposable Attention Model for Natural Language Inference]] This paper, uses intra-attention from [[https://arxiv.org/pdf/1601.06733.pdf|Cheng 2016]], and according to the Transformer paper, was the inspiration of self-attention in the Transformer.
  * **[[https://arxiv.org/pdf/1703.03130.pdf|Lin et al 2017 - A Structured Self-Attentive Sentence Embedding]]** Introduces the term "self-attention," which they say is slightly different than Cheng et al's intra-attention.
  * [[https://arxiv.org/pdf/1706.03762.pdf|Vaswani et al 2017 - Attention Is All You Need]] Introduced multi-head attention and the Transformer architecture
  * **LSH Attention**
    * [[https://arxiv.org/pdf/2001.04451|Kitaev et al 2020 - Reformer: The Efficient Transformer]] Uses LSH to speed up attention
  * **Linearized Attention**
    * [[https://arxiv.org/pdf/2006.16236.pdf|Katharopoulos et al 2020 - Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention]] Misleading name.  This paper linearizes the softmax in the attention layers which makes it O(n) to compute
    * [[https://arxiv.org/pdf/2006.04768.pdf|Wang et al 2020 - Linformer: Self-Attention with Linear Complexity]]
  * Random Feature Attention: [[https://openreview.net/forum?id=QtTKTdVrFBB|Peng et al 2021 - Random Feature Attention]] Uses random features to approximate the softmax, making it O(1).  Drop in replacement for standard attention. Shows that it works in the Transformer, and is twice as fast.
  * **[[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]**
    * Early related work: [[https://arxiv.org/pdf/2112.05682|Rabe & Staats 2021 - Self-attention Does Not Need $O(n^2)$ Memory]]
  * Single-Headed Gated Attention (SHGA): [[https://arxiv.org/pdf/2209.10655.pdf|Ma et al 2022 - Mega: Moving Average Equipped Gated Attention]] Shows that single-headed gated attention can simulate multi-head attention, and is more expressive (see section 3.3 and Theorem 1).
  * **Sparse Attention**
    * Longformer
    * BigBird
    * Hierarchical Attention Transformers (HAT): [[https://arxiv.org/pdf/2210.05529|Chalkidis et al 2022 - An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification]]
  * **MoE Sparse Attention**
    * [[https://arxiv.org/pdf/2210.05144|Zhang et al 2022 - Mixture of Attention Heads: Selecting Attention Heads Per Token]]
    * [[https://arxiv.org/pdf/2312.07987|Csordas et al 2023 - SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention]]
    * [[https://arxiv.org/pdf/2410.11842|Jin et al 2024 - SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention]]

===== Papers =====
  * [[https://arxiv.org/pdf/2305.14380.pdf|Ni et al 2023 - Finding the Pillars of Strength for Multi-Head Attention]]

===== Related Pages =====
  * [[ml:nn_architectures|Neural Network Architectures]]
  * [[Transformers]]