Differences

This shows you the differences between two versions of the page.

--- nlp:attention_mechanisms [2025/03/19 06:24] – [Key Papers] jmflanig
+++ nlp:attention_mechanisms [2025/04/04 23:53] (current) – [Key Papers] jmflanig
@@ Line 19: / Line 19: @@
   * **[[https://arxiv.org/pdf/1703.03130.pdf|Lin et al 2017 - A Structured Self-Attentive Sentence Embedding]]** Introduces the term "self-attention," which they say is slightly different than Cheng et al's intra-attention.
   * [[https://arxiv.org/pdf/1706.03762.pdf|Vaswani et al 2017 - Attention Is All You Need]] Introduced multi-head attention and the Transformer architecture
+  * **LSH Attention**
+    * [[https://arxiv.org/pdf/2001.04451|Kitaev et al 2020 - Reformer: The Efficient Transformer]] Uses LSH to speed up attention
   * **Linearized Attention**
     * [[https://arxiv.org/pdf/2006.16236.pdf|Katharopoulos et al 2020 - Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention]] Misleading name.  This paper linearizes the softmax in the attention layers which makes it O(n) to compute