User Tools

Site Tools


nlp:transformers

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:transformers [2025/04/07 04:49] – [Papers] jmflanignlp:transformers [2025/10/17 20:09] (current) – [Analysis and Interpretation] jmflanig
Line 18: Line 18:
  
 See also the group [[https://flann.super.site/|FLaNN]] (Formal Languages and Neural Networks). See also the group [[https://flann.super.site/|FLaNN]] (Formal Languages and Neural Networks).
 +  * **Overviews**
 +    * [[https://arxiv.org/pdf/2311.00208|Strobl et al 2023 - What Formal Languages Can Transformers Express? A Survey]]
   * **[[https://arxiv.org/pdf/1906.06755.pdf|Hahn 2019 - Theoretical Limitations of Self-Attention in Neural Sequence Models]]** Indicates Transformers can't even represent finite state machines   * **[[https://arxiv.org/pdf/1906.06755.pdf|Hahn 2019 - Theoretical Limitations of Self-Attention in Neural Sequence Models]]** Indicates Transformers can't even represent finite state machines
     * [[https://arxiv.org/pdf/2202.12172.pdf|Chiang et al 2022 - Overcoming a Theoretical Limitation of Self-Attention]]     * [[https://arxiv.org/pdf/2202.12172.pdf|Chiang et al 2022 - Overcoming a Theoretical Limitation of Self-Attention]]
Line 31: Line 33:
   * [[https://arxiv.org/pdf/2210.02671.pdf|Merrill et al 2022 - Transformers Can Be Translated to First-Order Logic with Majority Quantifiers]]   * [[https://arxiv.org/pdf/2210.02671.pdf|Merrill et al 2022 - Transformers Can Be Translated to First-Order Logic with Majority Quantifiers]]
   * [[https://arxiv.org/pdf/2210.10749.pdf|Liu et al 2022 - Transformers Learn Shortcuts to Automata]]   * [[https://arxiv.org/pdf/2210.10749.pdf|Liu et al 2022 - Transformers Learn Shortcuts to Automata]]
 +  * [[https://arxiv.org/pdf/2301.10743|Chaing et al 2023 - Tighter Bounds on the Expressivity of Transformer Encoders]] 
 +    * Follow up work: 
 +    * [[https://arxiv.org/pdf/2210.02671|Merrill & Sabharwal 2023 - A Logic for Expressing Log-Precision Transformers]] 
 +  * [[https://arxiv.org/pdf/2310.07923|Merrill & Sabharwal 2024 - The Expressive Power of Transformers with Chain of Thought]] CoT increases the expressive power of Transformers, to recognizing regular languages (length of CoT linear in the input length), to recognizing exactly the class of polynomial-time solvable problems (length of CoT polynomial to the input length + PreNorm) 
 +  * [[https://arxiv.org/pdf/2505.18948|Merrill & Sabharwal 2025 - Exact Expressive Power of Transformers with Padding]] 
 +  * **[[https://arxiv.org/pdf/2505.23623|Li & Cotterell 2025 - Characterizing the Expressivity of Transformer Language Models]]**
  
 ===== Analysis and Interpretation ===== ===== Analysis and Interpretation =====
Line 39: Line 46:
   * [[https://arxiv.org/pdf/2008.02217.pdf|Ramsauer et al 2020 - Hopfield Networks is All You Need]]   * [[https://arxiv.org/pdf/2008.02217.pdf|Ramsauer et al 2020 - Hopfield Networks is All You Need]]
   * [[https://arxiv.org/pdf/2012.14913.pdf|Geva et al 2020 - Transformer Feed-Forward Layers Are Key-Value Memories]]   * [[https://arxiv.org/pdf/2012.14913.pdf|Geva et al 2020 - Transformer Feed-Forward Layers Are Key-Value Memories]]
 +  * [[https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens|2020 - The Logit Lens]] Used many places, see [[https://arxiv.org/pdf/2503.11667|LogitLens4LLMs]] for some examples
   * [[https://arxiv.org/pdf/2310.03686|Langedijk et al 2023 - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers]]   * [[https://arxiv.org/pdf/2310.03686|Langedijk et al 2023 - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers]]
   * **For decoders/LLMs**   * **For decoders/LLMs**
Line 73: Line 81:
   * [[https://arxiv.org/pdf/2006.16668.pdf|Lepikhin, et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]]   * [[https://arxiv.org/pdf/2006.16668.pdf|Lepikhin, et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]]
   * [[https://arxiv.org/pdf/2110.01786.pdf|Zhang et al 2021 - MoEfication: Transformer Feed-forward Layers are Mixtures of Experts]]   * [[https://arxiv.org/pdf/2110.01786.pdf|Zhang et al 2021 - MoEfication: Transformer Feed-forward Layers are Mixtures of Experts]]
 +  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]
  
 ===== Ablation Experiments on the Transformer ===== ===== Ablation Experiments on the Transformer =====
Line 109: Line 118:
     * [[https://arxiv.org/pdf/2003.07845.pdf|Shen et al 2020 - PowerNorm: Rethinking Batch Normalization in Transformers]]     * [[https://arxiv.org/pdf/2003.07845.pdf|Shen et al 2020 - PowerNorm: Rethinking Batch Normalization in Transformers]]
     * [[https://arxiv.org/pdf/2203.00555.pdf|Wang et al 2022 - DeepNet: Scaling Transformers to 1,000 Layers]]     * [[https://arxiv.org/pdf/2203.00555.pdf|Wang et al 2022 - DeepNet: Scaling Transformers to 1,000 Layers]]
 +  * Stabilization of Training
 +    * [[https://arxiv.org/pdf/2303.06296|Zhai et al 2023 - Stabilizing Transformer Training by Preventing Attention Entropy Collapse]]
   * Miscellaneous topics   * Miscellaneous topics
     * [[https://arxiv.org/pdf/2010.09697.pdf|Merrill et al 2020 - Parameter Norm Growth During Training of Transformers]]     * [[https://arxiv.org/pdf/2010.09697.pdf|Merrill et al 2020 - Parameter Norm Growth During Training of Transformers]]
Line 143: Line 154:
 ==== Survey Papers ==== ==== Survey Papers ====
   * [[https://arxiv.org/pdf/2302.14502|Dong et al 2023 - A Survey on Long Text Modeling with Transformers]]   * [[https://arxiv.org/pdf/2302.14502|Dong et al 2023 - A Survey on Long Text Modeling with Transformers]]
 +  * [[https://arxiv.org/pdf/2311.12351|Huang et al 2023 - Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey]]
  
 ==== Papers ==== ==== Papers ====
Line 153: Line 165:
   * [[https://arxiv.org/pdf/2306.14893.pdf|Guo et al 2023 - LongCoder: A Long-Range Pre-trained Language Model for Code Completion]]   * [[https://arxiv.org/pdf/2306.14893.pdf|Guo et al 2023 - LongCoder: A Long-Range Pre-trained Language Model for Code Completion]]
   * **[[https://arxiv.org/pdf/2306.15595.pdf|Chen et al 2023 - Extending Context Window of Large Language Models via Positional Interpolation]]**   * **[[https://arxiv.org/pdf/2306.15595.pdf|Chen et al 2023 - Extending Context Window of Large Language Models via Positional Interpolation]]**
 +  * [[https://arxiv.org/pdf/2307.02486|Ding et al 2023 - LongNet: Scaling Transformers to
 +1,000,000,000 Tokens]]
   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]
   * [[https://arxiv.org/pdf/2311.04879.pdf|Yang 2023 - LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models]]   * [[https://arxiv.org/pdf/2311.04879.pdf|Yang 2023 - LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models]]
Line 159: Line 173:
   * **[[https://arxiv.org/pdf/2404.07143|Munkhdalai et al 2024 - Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention]]**   * **[[https://arxiv.org/pdf/2404.07143|Munkhdalai et al 2024 - Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention]]**
   * [[https://arxiv.org/pdf/2308.16137|Han et al 2023 - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models]] Expands the length limit to 200 million with no additional training and is O(n)   * [[https://arxiv.org/pdf/2308.16137|Han et al 2023 - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models]] Expands the length limit to 200 million with no additional training and is O(n)
 +  * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]]
   * [[https://arxiv.org/pdf/2402.14848|Levy et al 2024 - Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models]]   * [[https://arxiv.org/pdf/2402.14848|Levy et al 2024 - Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models]]
   * [[https://arxiv.org/pdf/2406.14673|Lu et al 2024 - Insights into LLM Long-Context Failures: When Transformers Know but Don’t Tell]]   * [[https://arxiv.org/pdf/2406.14673|Lu et al 2024 - Insights into LLM Long-Context Failures: When Transformers Know but Don’t Tell]]
nlp/transformers.1744001347.txt.gz · Last modified: 2025/04/07 04:49 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki