User Tools

Site Tools


nlp:attention_mechanisms

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:attention_mechanisms [2025/03/13 22:10] – [Key Papers] jmflanignlp:attention_mechanisms [2025/04/04 23:53] (current) – [Key Papers] jmflanig
Line 19: Line 19:
   * **[[https://arxiv.org/pdf/1703.03130.pdf|Lin et al 2017 - A Structured Self-Attentive Sentence Embedding]]** Introduces the term "self-attention," which they say is slightly different than Cheng et al's intra-attention.   * **[[https://arxiv.org/pdf/1703.03130.pdf|Lin et al 2017 - A Structured Self-Attentive Sentence Embedding]]** Introduces the term "self-attention," which they say is slightly different than Cheng et al's intra-attention.
   * [[https://arxiv.org/pdf/1706.03762.pdf|Vaswani et al 2017 - Attention Is All You Need]] Introduced multi-head attention and the Transformer architecture   * [[https://arxiv.org/pdf/1706.03762.pdf|Vaswani et al 2017 - Attention Is All You Need]] Introduced multi-head attention and the Transformer architecture
 +  * **LSH Attention**
 +    * [[https://arxiv.org/pdf/2001.04451|Kitaev et al 2020 - Reformer: The Efficient Transformer]] Uses LSH to speed up attention
   * **Linearized Attention**   * **Linearized Attention**
     * [[https://arxiv.org/pdf/2006.16236.pdf|Katharopoulos et al 2020 - Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention]] Misleading name.  This paper linearizes the softmax in the attention layers which makes it O(n) to compute     * [[https://arxiv.org/pdf/2006.16236.pdf|Katharopoulos et al 2020 - Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention]] Misleading name.  This paper linearizes the softmax in the attention layers which makes it O(n) to compute
Line 24: Line 26:
   * Random Feature Attention: [[https://openreview.net/forum?id=QtTKTdVrFBB|Peng et al 2021 - Random Feature Attention]] Uses random features to approximate the softmax, making it O(1).  Drop in replacement for standard attention. Shows that it works in the Transformer, and is twice as fast.   * Random Feature Attention: [[https://openreview.net/forum?id=QtTKTdVrFBB|Peng et al 2021 - Random Feature Attention]] Uses random features to approximate the softmax, making it O(1).  Drop in replacement for standard attention. Shows that it works in the Transformer, and is twice as fast.
   * **[[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]**   * **[[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]**
 +    * Early related work: [[https://arxiv.org/pdf/2112.05682|Rabe & Staats 2021 - Self-attention Does Not Need $O(n^2)$ Memory]]
   * Single-Headed Gated Attention (SHGA): [[https://arxiv.org/pdf/2209.10655.pdf|Ma et al 2022 - Mega: Moving Average Equipped Gated Attention]] Shows that single-headed gated attention can simulate multi-head attention, and is more expressive (see section 3.3 and Theorem 1).   * Single-Headed Gated Attention (SHGA): [[https://arxiv.org/pdf/2209.10655.pdf|Ma et al 2022 - Mega: Moving Average Equipped Gated Attention]] Shows that single-headed gated attention can simulate multi-head attention, and is more expressive (see section 3.3 and Theorem 1).
   * **Sparse Attention**   * **Sparse Attention**
nlp/attention_mechanisms.1741903807.txt.gz · Last modified: 2025/03/13 22:10 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki