Differences

This shows you the differences between two versions of the page.

--- nlp:transformers [2025/04/07 04:49] – [Papers] jmflanig
+++ nlp:transformers [2025/10/17 20:09] (current) – [Analysis and Interpretation] jmflanig
@@ Line 18: / Line 18: @@
 See also the group [[https://flann.super.site/|FLaNN]] (Formal Languages and Neural Networks).
+  * **Overviews**
+    * [[https://arxiv.org/pdf/2311.00208|Strobl et al 2023 - What Formal Languages Can Transformers Express? A Survey]]
   * **[[https://arxiv.org/pdf/1906.06755.pdf|Hahn 2019 - Theoretical Limitations of Self-Attention in Neural Sequence Models]]** Indicates Transformers can't even represent finite state machines
     * [[https://arxiv.org/pdf/2202.12172.pdf|Chiang et al 2022 - Overcoming a Theoretical Limitation of Self-Attention]]
@@ Line 31: / Line 33: @@
   * [[https://arxiv.org/pdf/2210.02671.pdf|Merrill et al 2022 - Transformers Can Be Translated to First-Order Logic with Majority Quantifiers]]
   * [[https://arxiv.org/pdf/2210.10749.pdf|Liu et al 2022 - Transformers Learn Shortcuts to Automata]]
+  * [[https://arxiv.org/pdf/2301.10743|Chaing et al 2023 - Tighter Bounds on the Expressivity of Transformer Encoders]]
+    * Follow up work:
+    * [[https://arxiv.org/pdf/2210.02671|Merrill & Sabharwal 2023 - A Logic for Expressing Log-Precision Transformers]]
+  * [[https://arxiv.org/pdf/2310.07923|Merrill & Sabharwal 2024 - The Expressive Power of Transformers with Chain of Thought]] CoT increases the expressive power of Transformers, to recognizing regular languages (length of CoT linear in the input length), to recognizing exactly the class of polynomial-time solvable problems (length of CoT polynomial to the input length + PreNorm)
+  * [[https://arxiv.org/pdf/2505.18948|Merrill & Sabharwal 2025 - Exact Expressive Power of Transformers with Padding]]
+  * **[[https://arxiv.org/pdf/2505.23623|Li & Cotterell 2025 - Characterizing the Expressivity of Transformer Language Models]]**
 ===== Analysis and Interpretation =====
@@ Line 39: / Line 46: @@
   * [[https://arxiv.org/pdf/2008.02217.pdf|Ramsauer et al 2020 - Hopfield Networks is All You Need]]
   * [[https://arxiv.org/pdf/2012.14913.pdf|Geva et al 2020 - Transformer Feed-Forward Layers Are Key-Value Memories]]
+  * [[https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens|2020 - The Logit Lens]] Used many places, see [[https://arxiv.org/pdf/2503.11667|LogitLens4LLMs]] for some examples
   * [[https://arxiv.org/pdf/2310.03686|Langedijk et al 2023 - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers]]
   * **For decoders/LLMs**
@@ Line 73: / Line 81: @@
   * [[https://arxiv.org/pdf/2006.16668.pdf|Lepikhin, et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]]
   * [[https://arxiv.org/pdf/2110.01786.pdf|Zhang et al 2021 - MoEfication: Transformer Feed-forward Layers are Mixtures of Experts]]
+  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]
 ===== Ablation Experiments on the Transformer =====
@@ Line 109: / Line 118: @@
     * [[https://arxiv.org/pdf/2003.07845.pdf|Shen et al 2020 - PowerNorm: Rethinking Batch Normalization in Transformers]]
     * [[https://arxiv.org/pdf/2203.00555.pdf|Wang et al 2022 - DeepNet: Scaling Transformers to 1,000 Layers]]
+  * Stabilization of Training
+    * [[https://arxiv.org/pdf/2303.06296|Zhai et al 2023 - Stabilizing Transformer Training by Preventing Attention Entropy Collapse]]
   * Miscellaneous topics
     * [[https://arxiv.org/pdf/2010.09697.pdf|Merrill et al 2020 - Parameter Norm Growth During Training of Transformers]]
@@ Line 143: / Line 154: @@
 ==== Survey Papers ====
   * [[https://arxiv.org/pdf/2302.14502|Dong et al 2023 - A Survey on Long Text Modeling with Transformers]]
+  * [[https://arxiv.org/pdf/2311.12351|Huang et al 2023 - Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey]]
 ==== Papers ====
@@ Line 153: / Line 165: @@
   * [[https://arxiv.org/pdf/2306.14893.pdf|Guo et al 2023 - LongCoder: A Long-Range Pre-trained Language Model for Code Completion]]
   * **[[https://arxiv.org/pdf/2306.15595.pdf|Chen et al 2023 - Extending Context Window of Large Language Models via Positional Interpolation]]**
+  * [[https://arxiv.org/pdf/2307.02486|Ding et al 2023 - LongNet: Scaling Transformers to
+,000,000,000 Tokens]]
   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]
   * [[https://arxiv.org/pdf/2311.04879.pdf|Yang 2023 - LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models]]
@@ Line 159: / Line 173: @@
   * **[[https://arxiv.org/pdf/2404.07143|Munkhdalai et al 2024 - Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention]]**
   * [[https://arxiv.org/pdf/2308.16137|Han et al 2023 - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models]] Expands the length limit to 200 million with no additional training and is O(n)
+  * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]]
   * [[https://arxiv.org/pdf/2402.14848|Levy et al 2024 - Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models]]
   * [[https://arxiv.org/pdf/2406.14673|Lu et al 2024 - Insights into LLM Long-Context Failures: When Transformers Know but Don’t Tell]]