nlp:transformers
Table of Contents
Transformers
Overview
- See Transformers in the ML Overview for introductory blog posts
- Original paper: Vaswani et al 2017 - Attention Is All You Need
- Textbook (SLP): Ch 9.7: Transformers
- A walkthrough of transformer architecture code Contains a very good picture of the computation graph.
Surveys
Transformer Properties
Time and Space Complexity: The Transformer uses O(n^2) computation time, and O(n^2) memory (Subramanian et al 2019) due to the attention matrix. However, the experiments in Subramanian et al 2019 (fig 2) seem to show a linear increase in memory usage with sequence length. Presumably this is because the attention matrix does not dominate the memory footprint.
Expressiveness and Representation Power:
See also the group FLaNN (Formal Languages and Neural Networks).
- Overviews
- Hahn 2019 - Theoretical Limitations of Self-Attention in Neural Sequence Models Indicates Transformers can't even represent finite state machines
- Henderson 2020 - The Unstoppable Rise of Computational Linguistics in Deep Learning Argues why the Transformer is so good at language
- Transformer Programs
-
- Follow up work:
- Merrill & Sabharwal 2024 - The Expressive Power of Transformers with Chain of Thought CoT increases the expressive power of Transformers, to recognizing regular languages (length of CoT linear in the input length), to recognizing exactly the class of polynomial-time solvable problems (length of CoT polynomial to the input length + PreNorm)
Analysis and Interpretation
See also Mechanistic Interpretability and Transformer Circuits.
- 2020 - The Logit Lens Used many places, see LogitLens4LLMs for some examples
- For decoders/LLMs
- Feucht et al 2024 - Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs Finds the implicit vocabulary in a Transformer decoder model
- Transformer Programs
- Rank Collapse in Transformers
Transformer Variants: Overviews
- Blog post: Lil'Log: The Transformer Family
- Narang et al 2021 - Do Transformer Modifications Transfer Across Implementations and Applications? Experimental comparison of Transformer model variants
Improvements
- So et al 2019 - The Evolved Transformer Neural architecture search for Transformer variants
- Nguyen & Salazar 2019 - Transformers without Tears: Improving the Normalization of Self-Attention Many of these changes are default in the popular Transformer codebases
- Shazeer 2019 - Fast Transformer Decoding: One Write-Head is All You Need twitter Used in AlphaCode to speed up decoding.
- Sukhbaatar et al 2019 - Adaptive Attention Span in Transformers Related to Milad's work.
-
- Dao et al 2023 - Fash Decoding Speeds up flash attention for decoding. (Essentially, fixes a problem in the way decoding was implemented initially, so it's much faster. The new way is the more natural way it should have been implemented.)
Mixture of Expert (MoE) Transformers
Ablation Experiments on the Transformer
These are ablation experiments on the Transformer, such as ablating the multi-head attention, or comparing to an LSTM with multi-head attention.
- Chen et al 2018 - The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation Outperforms the Transformer with a stacked BiLSTM with multi-head attention and other tricks from the Transformer. Slightly slower per token, but converges faster.
- Domhan 2018 - How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures Tries combinations of Transformer, RNN, CNN decoder and encoder layers. Shows that “one can bring recurrent and convolutional models very close to the Transformer performance by borrowing concepts from the Transformer architecture, but not using self-attention.” In particular, they find that:
- “Source attention on lower encoder layers brings no additional benefit.”
- “Multiple source attention layers and residual feed-forward layers are key.”
- “Self-attention is more important for the source than for the target side.”
- Simple Self-Attention Network (SSAN) Ambartsoumian & Popowich 2018 - Self-Attention: A Better Building Block for Sentiment Analysis Neural Network Classifiers Compares the transformer to a 1-layer and 2-layer single-headed transformer layer.
- RNN with attention back to previous states. Has anyone compared this to the transformer? I can't remember.
- 2019 - Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder Shared weights for encoder and decoder. Very natural if you consider the seq2seq transformer as a conditional language model.
- Fixed (not learned) attention patterns in the encoder: Raganato et al 2020 - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
- Linearizing the softmax in the attention - O(n) to compute: Katharopoulos et al 2020
- No position embeddings (NoPE): 2023 - The Impact of Positional Encoding on Length Generalization in Transformers
Pruning Attention Heads
Training
- Initialization issues
- Optimizer issues
- Warm-up, see Warm-up
- Normalization issues
- RMSNorm. Improvement to layer normalization. Shown by Narang et al 2021 to work well for Transformers.
- Xiong et al 2020 - On Layer Normalization in the Transformer Architecture Says pre-norm transformers don't need warm-up and are often better
- Stabilization of Training
- Miscellaneous topics
Efficient Transformers
Survey Papers
Papers
- Kitaev et al 2020 - Reformer: The Efficient Transformer Uses LSH to speed up attention
- Hofstätter et al 2020 - Local Self-Attention over Long Text for Efficient Document Retrieval Sliding window local attention mechanism
- Katharopoulos et al 2020 - Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Misleading name. This paper linearizes the softmax in the attention layers which makes it O(n) to compute
- Choromanski et al 2020 - Rethinking Attention with Performers Fast attention via positive orthogonal random features.
- Xiong et al 2021 - Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention Similar to SVD, but approximately linearizes the softmax by selecting landmarks before the softmax.
- Peng et al 2021 - Random Feature Attention Uses random features to approximate the softmax, making it O(1). Drop in replacement for standard attention. Experiments with the Transformer.
- Hourglass Transformer: Hierarchical Transformers Are More Efficient Language Models - Has three blocks of layers: ones that downsample the tokens through pooling, ones that process, and ones that upsample.
Datasets and Benchmarks
- https://github.com/google-research/long-range-arena|LRA (pronounced “ELRA”): Tay et al 2020 - Long Range Arena: A Benchmark for Efficient Transformers (Not just NLP tasks)
Long-Context Transformers
Survey Papers
Papers
- Han et al 2023 - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models Expands the length limit to 200 million with no additional training and is O(n)
Position Embeddings
- Learned position embeddings: Gehring et al 2017 - Convolutional Sequence to Sequence Learning
- 2022 - Randomized Positional Encodings Boost Length Generalization of Transformers (Submitted to ACL 2022, not accepted.) Has good related work in section 4, comparison to prior work.
- Press et al 2021 - Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Uses exponential weighted decay in the attention to encode positional information
- No Position Embeddings Transformer (NoPE): 2023 - The Impact of Positional Encoding on Length Generalization in Transformers
Software
See also Software.
Related Pages
nlp/transformers.txt · Last modified: 2025/10/17 20:09 by jmflanig