User Tools

Site Tools


nlp:attention_mechanisms

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:attention_mechanisms [2023/07/30 18:06] – [Key Papers] jmflanignlp:attention_mechanisms [2025/04/04 23:53] (current) – [Key Papers] jmflanig
Line 18: Line 18:
   * [[https://arxiv.org/pdf/1606.01933.pdf|Parikh et al 2016 - A Decomposable Attention Model for Natural Language Inference]] This paper, uses intra-attention from [[https://arxiv.org/pdf/1601.06733.pdf|Cheng 2016]], and according to the Transformer paper, was the inspiration of self-attention in the Transformer.   * [[https://arxiv.org/pdf/1606.01933.pdf|Parikh et al 2016 - A Decomposable Attention Model for Natural Language Inference]] This paper, uses intra-attention from [[https://arxiv.org/pdf/1601.06733.pdf|Cheng 2016]], and according to the Transformer paper, was the inspiration of self-attention in the Transformer.
   * **[[https://arxiv.org/pdf/1703.03130.pdf|Lin et al 2017 - A Structured Self-Attentive Sentence Embedding]]** Introduces the term "self-attention," which they say is slightly different than Cheng et al's intra-attention.   * **[[https://arxiv.org/pdf/1703.03130.pdf|Lin et al 2017 - A Structured Self-Attentive Sentence Embedding]]** Introduces the term "self-attention," which they say is slightly different than Cheng et al's intra-attention.
 +  * [[https://arxiv.org/pdf/1706.03762.pdf|Vaswani et al 2017 - Attention Is All You Need]] Introduced multi-head attention and the Transformer architecture
 +  * **LSH Attention**
 +    * [[https://arxiv.org/pdf/2001.04451|Kitaev et al 2020 - Reformer: The Efficient Transformer]] Uses LSH to speed up attention
 +  * **Linearized Attention**
 +    * [[https://arxiv.org/pdf/2006.16236.pdf|Katharopoulos et al 2020 - Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention]] Misleading name.  This paper linearizes the softmax in the attention layers which makes it O(n) to compute
 +    * [[https://arxiv.org/pdf/2006.04768.pdf|Wang et al 2020 - Linformer: Self-Attention with Linear Complexity]]
 +  * Random Feature Attention: [[https://openreview.net/forum?id=QtTKTdVrFBB|Peng et al 2021 - Random Feature Attention]] Uses random features to approximate the softmax, making it O(1).  Drop in replacement for standard attention. Shows that it works in the Transformer, and is twice as fast.
   * **[[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]**   * **[[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]**
 +    * Early related work: [[https://arxiv.org/pdf/2112.05682|Rabe & Staats 2021 - Self-attention Does Not Need $O(n^2)$ Memory]]
   * Single-Headed Gated Attention (SHGA): [[https://arxiv.org/pdf/2209.10655.pdf|Ma et al 2022 - Mega: Moving Average Equipped Gated Attention]] Shows that single-headed gated attention can simulate multi-head attention, and is more expressive (see section 3.3 and Theorem 1).   * Single-Headed Gated Attention (SHGA): [[https://arxiv.org/pdf/2209.10655.pdf|Ma et al 2022 - Mega: Moving Average Equipped Gated Attention]] Shows that single-headed gated attention can simulate multi-head attention, and is more expressive (see section 3.3 and Theorem 1).
 +  * **Sparse Attention**
 +    * Longformer
 +    * BigBird
 +    * Hierarchical Attention Transformers (HAT): [[https://arxiv.org/pdf/2210.05529|Chalkidis et al 2022 - An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification]]
 +  * **MoE Sparse Attention**
 +    * [[https://arxiv.org/pdf/2210.05144|Zhang et al 2022 - Mixture of Attention Heads: Selecting Attention Heads Per Token]]
 +    * [[https://arxiv.org/pdf/2312.07987|Csordas et al 2023 - SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention]]
 +    * [[https://arxiv.org/pdf/2410.11842|Jin et al 2024 - SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention]]
  
 ===== Papers ===== ===== Papers =====
Line 25: Line 41:
  
 ===== Related Pages ===== ===== Related Pages =====
 +  * [[ml:nn_architectures|Neural Network Architectures]]
   * [[Transformers]]   * [[Transformers]]
  
nlp/attention_mechanisms.1690740392.txt.gz · Last modified: 2023/07/30 18:06 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki