This is an old revision of the document!

Attention Mechanisms

Overviews

Summary of Attention Mechanisms

From Attention? Attention! Blog post by Weng:

Key Papers

Graves 2013 - Generating Sequences With Recurrent Neural Networks Uses an alignment mechanism for handwriting generation, similar to the attention mechanism. The Deep Learning Book p. 415 at the end of Ch 10 says “The idea of attention mechanisms for neural networks was introduced even earlier, in the context of handwriting generation (Graves, 2013), with an attention mechanism that was constrained to move only forward in time through the sequence.”
Bahdanau et al 2014 - Neural Machine Translation by Jointly Learning to Align and Translate The paper that started it all. Introduced the attention mechanism and initiated the deep learning revolution in NLP. Basically, the paper that got neural machine translation to actually work.
Ba et al 2014 - Multiple Object Recognition with Visual Attention
Luong et al 2015 - Effective Approaches to Attention-based Neural Machine Translation This paper introduced dot product attention.
Cheng et al 2016 - Long Short-Term Memory-Networks for Machine Reading This paper introduced self-attention, called intra-attention (introduced in the Long Short-Term Memory-Networks, LSTMNs, section 3.2). See Fig 1 for a picture.
Parikh et al 2016 - A Decomposable Attention Model for Natural Language Inference This paper, uses intra-attention from Cheng 2016, and according to the Transformer paper, was the inspiration of self-attention in the Transformer.
Lin et al 2017 - A Structured Self-Attentive Sentence Embedding Introduces the term “self-attention,” which they say is slightly different than Cheng et al's intra-attention.
Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Single-Headed Gated Attention (SHGA): Ma et al 2022 - Mega: Moving Average Equipped Gated Attention Shows that single-headed gated attention can simulate multi-head attention, and is more expressive (see section 3.3 and Theorem 1).

Papers

Ni et al 2023 - Finding the Pillars of Strength for Multi-Head Attention

NLP Wiki

Table of Contents

Attention Mechanisms

Overviews

Summary of Attention Mechanisms

Key Papers

Papers

Related Pages