====== Transformers ======

===== Overview =====
  * See Transformers in the [[ml:ML Overview]] for introductory blog posts
  * Original paper: [[https://arxiv.org/pdf/1706.03762.pdf|Vaswani et al 2017 - Attention Is All You Need]]
  * [[https://nlp.seas.harvard.edu/2018/04/03/attention.html|The Annotated Transformer]]
  * Textbook (SLP): [[https://web.stanford.edu/~jurafsky/slp3/9.pdf#page=17|Ch 9.7: Transformers]]
  * [[https://github.com/markriedl/transformer-walkthrough|A walkthrough of transformer architecture code]] Contains a very good picture of the computation graph.

===== Surveys =====
  * [[https://arxiv.org/pdf/2106.04554.pdf|Lin et al 2021 - A Survey of Transformers]]


===== Transformer Properties =====
**Time and Space Complexity:** The Transformer uses O(n^2) computation time, and O(n^2) memory ([[https://arxiv.org/pdf/2005.00581.pdf|Subramanian et al 2019]]) due to the attention matrix.  However, the experiments in [[https://arxiv.org/pdf/2005.00581.pdf|Subramanian et al 2019]] (fig 2) seem to show a linear increase in memory usage with sequence length.  Presumably this is because the attention matrix does not dominate the memory footprint.

**Expressiveness and Representation Power:**

See also the group [[https://flann.super.site/|FLaNN]] (Formal Languages and Neural Networks).
  * **Overviews**
    * [[https://arxiv.org/pdf/2311.00208|Strobl et al 2023 - What Formal Languages Can Transformers Express? A Survey]]
  * **[[https://arxiv.org/pdf/1906.06755.pdf|Hahn 2019 - Theoretical Limitations of Self-Attention in Neural Sequence Models]]** Indicates Transformers can't even represent finite state machines
    * [[https://arxiv.org/pdf/2202.12172.pdf|Chiang et al 2022 - Overcoming a Theoretical Limitation of Self-Attention]]
  * [[https://aclanthology.org/2020.acl-main.561.pdf|Henderson 2020 - The Unstoppable Rise of Computational Linguistics in Deep Learning]] Argues why the Transformer is so good at language
  * [[https://arxiv.org/pdf/2009.11264.pdf|Bhattamishra et al 2020 - On the Ability and Limitations of Transformers to Recognize Formal Languages]]
  * [[https://transformer-circuits.pub/2021/framework/index.html|Elhage et al 2021 - A Mathematical Framework for Transformer Circuits]]
  * [[https://aclanthology.org/2022.tacl-1.49.pdf|Merrill et al 2022 - Saturated Transformers are Constant-Depth Threshold Circuit]]
  * **Transformer Programs**
    * **[[https://arxiv.org/pdf/2106.06981.pdf|Weiss et al 2021 - Thinking Like Transformers]]**
    * [[https://arxiv.org/pdf/2301.05062.pdf|Lindner et al 2023 - Tracr: Compiled Transformers as a Laboratory for Interpretability]]
    * [[https://arxiv.org/pdf/2306.01128|Friedman et al 2023 - Learning Transformer Programs]]
  * [[https://arxiv.org/pdf/2207.02098.pdf|Delétang et al 2022 - Neural Networks and the Chomsky Hierarchy]]
  * [[https://arxiv.org/pdf/2210.02671.pdf|Merrill et al 2022 - Transformers Can Be Translated to First-Order Logic with Majority Quantifiers]]
  * [[https://arxiv.org/pdf/2210.10749.pdf|Liu et al 2022 - Transformers Learn Shortcuts to Automata]]
  * [[https://arxiv.org/pdf/2301.10743|Chaing et al 2023 - Tighter Bounds on the Expressivity of Transformer Encoders]]
    * Follow up work:
    * [[https://arxiv.org/pdf/2210.02671|Merrill & Sabharwal 2023 - A Logic for Expressing Log-Precision Transformers]]
  * [[https://arxiv.org/pdf/2310.07923|Merrill & Sabharwal 2024 - The Expressive Power of Transformers with Chain of Thought]] CoT increases the expressive power of Transformers, to recognizing regular languages (length of CoT linear in the input length), to recognizing exactly the class of polynomial-time solvable problems (length of CoT polynomial to the input length + PreNorm)
  * [[https://arxiv.org/pdf/2505.18948|Merrill & Sabharwal 2025 - Exact Expressive Power of Transformers with Padding]]
  * **[[https://arxiv.org/pdf/2505.23623|Li & Cotterell 2025 - Characterizing the Expressivity of Transformer Language Models]]**

===== Analysis and Interpretation =====
See also [[ml:Mechanistic Interpretability]] and [[https://transformer-circuits.pub/|Transformer Circuits]].

  * [[https://twitter.com/lvwerra/status/1485301457813487619?s=21|Visualization of position embeddings in BERT and GPT-2]] (from [[https://twitter.com/mark_riedl/status/1555188022534176768|here]])
  * [[https://arxiv.org/pdf/2008.02217.pdf|Ramsauer et al 2020 - Hopfield Networks is All You Need]]
  * [[https://arxiv.org/pdf/2012.14913.pdf|Geva et al 2020 - Transformer Feed-Forward Layers Are Key-Value Memories]]
  * [[https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens|2020 - The Logit Lens]] Used many places, see [[https://arxiv.org/pdf/2503.11667|LogitLens4LLMs]] for some examples
  * [[https://arxiv.org/pdf/2310.03686|Langedijk et al 2023 - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers]]
  * **For decoders/LLMs**
    * [[https://arxiv.org/pdf/2406.20086|Feucht et al 2024 - Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs]] Finds the implicit vocabulary in a Transformer decoder model
  * **Transformer Programs**
    * RASP: [[https://arxiv.org/pdf/2106.06981|Weiss et al 2021 - Thinking like Transformers]]
    * [[https://arxiv.org/pdf/2301.05062|Lindner et al 2023 - Tracr: Compiled Transformers as a Laboratory for Interpretability]]
    * [[https://arxiv.org/pdf/2306.01128|Friedman et al 2023 - Learning Transformer Programs]]
  * **Rank Collapse in Transformers**
    * [[https://arxiv.org/pdf/2103.03404|Dong et al 2021 - Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth]]


===== Transformer Variants: Overviews =====
  * Blog post: [[https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html|Lil'Log: The Transformer Family]]
  * [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021 - Do Transformer Modifications Transfer Across Implementations and Applications?]] Experimental comparison of Transformer model variants

===== Improvements =====
  * [[https://arxiv.org/pdf/1803.02155.pdf|Shaw et al 2018 - Self-Attention with Relative Position Representations]]
  * [[https://arxiv.org/pdf/1901.02860.pdf|Dai et al 2019 - Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context]]
  * [[https://arxiv.org/pdf/1901.11117.pdf]|So et al 2019 - The Evolved Transformer]] [[ml:Neural architecture search]] for Transformer variants
  * **[[https://arxiv.org/pdf/1910.05895.pdf|Nguyen & Salazar 2019 - Transformers without Tears: Improving the Normalization of Self-Attention]]** Many of these changes are default in the popular Transformer codebases
  * [[https://arxiv.org/pdf/1911.02150.pdf|Shazeer 2019 - Fast Transformer Decoding: One Write-Head is All You Need]] [[https://twitter.com/miles_brundage/status/1192299504524857344|twitter]] Used in [[https://arxiv.org/pdf/2203.07814.pdf|AlphaCode]] to speed up decoding.
  * [[https://arxiv.org/pdf/2008.07772.pdf|Liu et al 2020 - Very Deep Transformers for Neural Machine Translation]]
  * [[https://arxiv.org/pdf/2005.00581.pdf|Subramanian et al 2020 - Multi-scale Transformer Language Models]]
  * [[https://arxiv.org/pdf/1905.07799.pdf|Sukhbaatar et al 2019 - Adaptive Attention Span in Transformers]] Related to Milad's work.
  * [[https://arxiv.org/pdf/2105.04241.pdf|Zemlyanskiy et al 2021 - ReadTwice: Reading Very Large Documents with Memories]]
  * [[https://arxiv.org/pdf/2110.07732.pdf|Csordás et al 2021 - The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization]]
  * [[https://arxiv.org/pdf/2203.00555.pdf|Wang et al 2022 - DeepNet: Scaling Transformers to 1,000 Layers]]
  * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]
    * [[https://princeton-nlp.github.io/flash-decoding/|Dao et al 2023 - Fash Decoding]] Speeds up flash attention for decoding. (Essentially, fixes a problem in the way decoding was implemented initially, so it's much faster. The new way is the more natural way it should have been implemented.)

==== Mixture of Expert (MoE) Transformers ====
  * [[https://arxiv.org/pdf/2006.16668.pdf|Lepikhin, et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]]
  * [[https://arxiv.org/pdf/2110.01786.pdf|Zhang et al 2021 - MoEfication: Transformer Feed-forward Layers are Mixtures of Experts]]
  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]

===== Ablation Experiments on the Transformer =====
These are ablation experiments on the Transformer, such as [[https://www.aclweb.org/anthology/W18-6219.pdf|ablating the multi-head attention]], or [[https://www.aclweb.org/anthology/P18-1167.pdf|comparing to an LSTM with multi-head attention]].
  * [[https://arxiv.org/pdf/1804.09849.pdf|Chen et al 2018 - The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation]] Outperforms the Transformer with a stacked BiLSTM with multi-head attention and other tricks from the Transformer.  Slightly slower per token, but converges faster.
  * **[[https://www.aclweb.org/anthology/P18-1167.pdf|Domhan 2018 - How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures]]** Tries combinations of Transformer, RNN, CNN decoder and encoder layers.  Shows that "one can bring recurrent and convolutional models very close to the Transformer performance by borrowing concepts from the Transformer architecture, but not using self-attention." In particular, they find that:
      * "Source attention on lower encoder layers brings no additional benefit."
      * "Multiple source attention layers and residual feed-forward layers are key."
      * "Self-attention is more important for the source than for the target side."
  * Simple Self-Attention Network (SSAN) [[https://www.aclweb.org/anthology/W18-6219.pdf|Ambartsoumian & Popowich 2018 - Self-Attention: A Better Building Block for Sentiment Analysis Neural Network Classifiers]]  Compares the transformer to a 1-layer and 2-layer single-headed transformer layer.
  * RNN with attention back to previous states.  Has anyone compared this to the transformer?  I can't remember.
  * [[https://ojs.aaai.org//index.php/AAAI/article/view/4487|2019 - Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder]] Shared weights for encoder and decoder.  Very natural if you consider the seq2seq transformer as a conditional language model.
  * Fixed (not learned) attention patterns in the encoder: [[https://arxiv.org/pdf/2002.10260.pdf|Raganato et al 2020 - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation]]
  * Linearizing the softmax in the attention - O(n) to compute: [[https://arxiv.org/pdf/2006.16236.pdf|Katharopoulos et al 2020]]
  * No position embeddings (NoPE): [[https://arxiv.org/pdf/2305.19466.pdf|2023 - The Impact of Positional Encoding on Length Generalization in Transformers]]
    * [[https://arxiv.org/pdf/2404.12224|Wang et al 2024 - Length Generalization of Causal Transformers without Position Encoding]]

==== Pruning Attention Heads ====
  * [[https://arxiv.org/pdf/1905.10650.pdf|Michel et al 2019 - Are Sixteen Heads Really Better than One?]]
  * [[https://www.aclweb.org/anthology/2020.emnlp-main.211.pdf|Behnke & Heafield 2020 - Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation]]

===== Training =====
  * **[[https://arxiv.org/pdf/1804.00247.pdf|Popel & Bojar 2018 - Training Tips for the Transformer Model]]**
  * [[ml:NN Initialization|Initialization]] issues
    * [[https://arxiv.org/pdf/1908.11365.pdf|Zhang et al 2019 - Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention]]
    * [[https://arxiv.org/pdf/2004.08249.pdf|Liu et al 2020 - Understanding the Difficulty of Training Transformers]]
    * [[https://arxiv.org/pdf/2008.07772.pdf|Liu et al 2020 - Very Deep Transformers for Neural Machine Translation]]
  * [[ml:Optimizers|Optimizer]] issues
    * [[https://arxiv.org/pdf/1912.03194.pdf|Zhang et al 2019 - Why are Adaptive Methods Good for Attention Models?]]
  * Warm-up, see [[ml:Learning Rate#Warm-up]]
  * [[ml:Normalization]] issues
    * [[https://arxiv.org/pdf/1910.05895.pdf|Nguyen & Salazar 2019 - Transformers without Tears: Improving the Normalization of Self-Attention]]
    * [[https://arxiv.org/pdf/1910.07467.pdf|RMSNorm]]. Improvement to layer normalization. Shown by [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021]] to work well for Transformers.
    * [[https://arxiv.org/pdf/2002.04745.pdf|Xiong et al 2020 - On Layer Normalization in the Transformer Architecture]] Says pre-norm transformers don't need warm-up and are often better
    * [[https://arxiv.org/pdf/2004.08249.pdf|Liu et al 2020 - Understanding the Difficulty of Training Transformers]]
    * [[https://arxiv.org/pdf/2003.07845.pdf|Shen et al 2020 - PowerNorm: Rethinking Batch Normalization in Transformers]]
    * [[https://arxiv.org/pdf/2203.00555.pdf|Wang et al 2022 - DeepNet: Scaling Transformers to 1,000 Layers]]
  * Stabilization of Training
    * [[https://arxiv.org/pdf/2303.06296|Zhai et al 2023 - Stabilizing Transformer Training by Preventing Attention Entropy Collapse]]
  * Miscellaneous topics
    * [[https://arxiv.org/pdf/2010.09697.pdf|Merrill et al 2020 - Parameter Norm Growth During Training of Transformers]]

===== Efficient Transformers =====

==== Survey Papers ====
  * [[https://arxiv.org/pdf/2009.06732.pdf|Tay et al 2020 - Efficient Transformers: A Survey]]
  * [[https://arxiv.org/pdf/2011.04006.pdf|Tay et al 2020 - Long Range Arena: A Benchmark for Efficient Transformers]]

==== Papers ====
  * [[https://arxiv.org/pdf/1902.09113.pdf|Guo et al 2019 - Star-Transformer]]
  * [[https://arxiv.org/pdf/1911.05507.pdf|Rae et al 2019 - Compressive Transformers for Long-Range Sequence Modelling]]
  * [[https://arxiv.org/pdf/2001.04451|Kitaev et al 2020 - Reformer: The Efficient Transformer]] Uses LSH to speed up attention
  * [[https://arxiv.org/pdf/2003.05997.pdf|Roy et al 2020 - Efficient Content-Based Sparse Attention with Routing Transformers]]
  * [[https://arxiv.org/pdf/2005.04908.pdf|Hofstätter et al 2020 - Local Self-Attention over Long Text for Efficient Document Retrieval]]  Sliding window local attention mechanism
  * [[https://arxiv.org/pdf/2006.04768.pdf|Wang et al 2020 - Linformer: Self-Attention with Linear Complexity]]
  * [[https://arxiv.org/pdf/2006.16236.pdf|Katharopoulos et al 2020 - Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention]] Misleading name.  This paper linearizes the softmax in the attention layers which makes it O(n) to compute
  * [[https://arxiv.org/pdf/2009.14794.pdf|Choromanski et al 2020 - Rethinking Attention with Performers]] Fast attention via positive orthogonal random features.
  * [[https://arxiv.org/pdf/2102.03902.pdf|Xiong et al 2021 - Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention]]  Similar to SVD, but approximately linearizes the softmax by selecting landmarks before the softmax.
  * [[https://arxiv.org/pdf/2101.03961.pdf|Fedus et al 2021 - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity]]
  * [[https://openreview.net/forum?id=QtTKTdVrFBB|Peng et al 2021 - Random Feature Attention]] Uses random features to approximate the softmax, making it O(1).  Drop in replacement for standard attention. Experiments with the Transformer.
  * [[https://arxiv.org/pdf/2105.03824.pdf|Lee-Thorp et al 2021 - FNet: Mixing Tokens with Fourier Transforms]]
  * [[https://arxiv.org/pdf/2106.01540.pdf|Ma et al 2021 - Luna: Linear Unified Nested Attention]]
  * Hourglass Transformer: [[https://arxiv.org/pdf/2110.13711.pdf|Hierarchical Transformers Are More Efficient Language Models]] - Has three blocks of layers: ones that downsample the tokens through pooling, ones that process, and ones that upsample.
  * **FLASH: [[https://arxiv.org/pdf/2202.10447.pdf|Hua et al 2022 - Transformer Quality in Linear Time]]**
  * [[https://arxiv.org/pdf/2211.09761.pdf|Nawrot et al 2022 - Efficient Transformers with Dynamic Token Pooling]]

==== Datasets and Benchmarks ====
  * [[https://arxiv.org/abs/2305.16300|https://github.com/google-research/long-range-arena|LRA]] (pronounced "ELRA"): [[https://arxiv.org/pdf/2011.04006.pdf|Tay et al 2020 - Long Range Arena: A Benchmark for Efficient Transformers]] (Not just NLP tasks)

===== Long-Context Transformers =====

==== Survey Papers ====
  * [[https://arxiv.org/pdf/2302.14502|Dong et al 2023 - A Survey on Long Text Modeling with Transformers]]
  * [[https://arxiv.org/pdf/2311.12351|Huang et al 2023 - Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey]]

==== Papers ====

  * [[https://arxiv.org/pdf/2004.05150|Beltagy et al 2020 - Longformer: The Long-Document Transformer]]
  * [[https://arxiv.org/pdf/2007.14062|Zaheer et al 2020 - Big Bird: Transformers for Longer Sequences]]
  * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]
  * **[[https://arxiv.org/pdf/2305.01625.pdf|Bertsch et al 2023 - Unlimiformer: Long-Range Transformers with Unlimited Length Input]]**
  * [[https://arxiv.org/pdf/2305.16300.pdf|Mohtashami et al 2023 - Landmark Attention: Random-Access Infinite Context Length for Transformers]]
  * [[https://arxiv.org/pdf/2306.14893.pdf|Guo et al 2023 - LongCoder: A Long-Range Pre-trained Language Model for Code Completion]]
  * **[[https://arxiv.org/pdf/2306.15595.pdf|Chen et al 2023 - Extending Context Window of Large Language Models via Positional Interpolation]]**
  * [[https://arxiv.org/pdf/2307.02486|Ding et al 2023 - LongNet: Scaling Transformers to
1,000,000,000 Tokens]]
  * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]
  * [[https://arxiv.org/pdf/2311.04879.pdf|Yang 2023 - LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models]]
  * [[https://arxiv.org/pdf/2311.04939.pdf|Li et al 2023 - LooGLE: Can Long-Context Language Models Understand Long Contexts?]]
  * [[https://arxiv.org/pdf/2401.04658.pdf|Qin et al 2024 - Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models]]
  * **[[https://arxiv.org/pdf/2404.07143|Munkhdalai et al 2024 - Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention]]**
  * [[https://arxiv.org/pdf/2308.16137|Han et al 2023 - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models]] Expands the length limit to 200 million with no additional training and is O(n)
  * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]]
  * [[https://arxiv.org/pdf/2402.14848|Levy et al 2024 - Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models]]
  * [[https://arxiv.org/pdf/2406.14673|Lu et al 2024 - Insights into LLM Long-Context Failures: When Transformers Know but Don’t Tell]]

===== Position Embeddings =====
  * Learned position embeddings: [[https://arxiv.org/pdf/1705.03122.pdf|Gehring et al 2017 - Convolutional Sequence to Sequence Learning]]
  * [[https://openreview.net/pdf?id=nMYj4argap|2022 - Randomized Positional Encodings Boost Length Generalization of Transformers]] (Submitted to ACL 2022, not accepted.)  Has good related work in section 4, comparison to prior work.
  * [[https://arxiv.org/pdf/2104.09864.pdf|Su et al 2021 - RoFormer: Enhanced Transformer with Rotary Position Embedding]]
  * [[https://arxiv.org/pdf/2108.12409.pdf|Press et al 2021 - Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation]] Uses exponential weighted decay in the attention to encode positional information
  * No Position Embeddings Transformer (NoPE): [[https://browse.arxiv.org/pdf/2305.19466.pdf|2023 - The Impact of Positional Encoding on Length Generalization in Transformers]]
  * [[https://arxiv.org/pdf/2306.15595.pdf|Chen et al 2023 - Extending Context Window of Large Language Models via Positional Interpolation]]


===== Software ======
See also [[ml:Software]].
  * Hugging Face Transformers [[https://www.aclweb.org/anthology/2020.emnlp-demos.6.pdf|Paper]]
  * Tutorials and example code
    * [[https://www.tensorflow.org/text/tutorials/transformer|Tensorflow's transformer tutorial]]
    * [[https://nlp.seas.harvard.edu/2018/04/03/attention.html|The Annotated Transformer]]
    * [[https://github.com/karpathy/nanoGPT|nanoGPT]] Small version by Andrej Karpathy. Very cool

===== Related Pages =====
  * [[Attention Mechanisms]]
  * [[BERT and Friends]]
  * [[Seq2seq]]
  * [[ml:State-Space Models]]