Differences

This shows you the differences between two versions of the page.

--- nlp:attention_mechanisms [2023/07/30 18:06] – [Key Papers] jmflanig
+++ nlp:attention_mechanisms [2025/04/04 23:53] (current) – [Key Papers] jmflanig
@@ Line 18: / Line 18: @@
   * [[https://arxiv.org/pdf/1606.01933.pdf|Parikh et al 2016 - A Decomposable Attention Model for Natural Language Inference]] This paper, uses intra-attention from [[https://arxiv.org/pdf/1601.06733.pdf|Cheng 2016]], and according to the Transformer paper, was the inspiration of self-attention in the Transformer.
   * **[[https://arxiv.org/pdf/1703.03130.pdf|Lin et al 2017 - A Structured Self-Attentive Sentence Embedding]]** Introduces the term "self-attention," which they say is slightly different than Cheng et al's intra-attention.
+  * [[https://arxiv.org/pdf/1706.03762.pdf|Vaswani et al 2017 - Attention Is All You Need]] Introduced multi-head attention and the Transformer architecture
+  * **LSH Attention**
+    * [[https://arxiv.org/pdf/2001.04451|Kitaev et al 2020 - Reformer: The Efficient Transformer]] Uses LSH to speed up attention
+  * **Linearized Attention**
+    * [[https://arxiv.org/pdf/2006.16236.pdf|Katharopoulos et al 2020 - Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention]] Misleading name.  This paper linearizes the softmax in the attention layers which makes it O(n) to compute
+    * [[https://arxiv.org/pdf/2006.04768.pdf|Wang et al 2020 - Linformer: Self-Attention with Linear Complexity]]
+  * Random Feature Attention: [[https://openreview.net/forum?id=QtTKTdVrFBB|Peng et al 2021 - Random Feature Attention]] Uses random features to approximate the softmax, making it O(1).  Drop in replacement for standard attention. Shows that it works in the Transformer, and is twice as fast.
   * **[[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]**
+    * Early related work: [[https://arxiv.org/pdf/2112.05682|Rabe & Staats 2021 - Self-attention Does Not Need $O(n^2)$ Memory]]
   * Single-Headed Gated Attention (SHGA): [[https://arxiv.org/pdf/2209.10655.pdf|Ma et al 2022 - Mega: Moving Average Equipped Gated Attention]] Shows that single-headed gated attention can simulate multi-head attention, and is more expressive (see section 3.3 and Theorem 1).
+  * **Sparse Attention**
+    * Longformer
+    * BigBird
+    * Hierarchical Attention Transformers (HAT): [[https://arxiv.org/pdf/2210.05529|Chalkidis et al 2022 - An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification]]
+  * **MoE Sparse Attention**
+    * [[https://arxiv.org/pdf/2210.05144|Zhang et al 2022 - Mixture of Attention Heads: Selecting Attention Heads Per Token]]
+    * [[https://arxiv.org/pdf/2312.07987|Csordas et al 2023 - SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention]]
+    * [[https://arxiv.org/pdf/2410.11842|Jin et al 2024 - SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention]]
 ===== Papers =====
@@ Line 25: / Line 41: @@
 ===== Related Pages =====
+  * [[ml:nn_architectures|Neural Network Architectures]]
   * [[Transformers]]