Differences

This shows you the differences between two versions of the page.

--- nlp:transformers [2025/06/02 00:50] – [Transformer Properties] jmflanig
+++ nlp:transformers [2025/10/17 20:09] (current) – [Analysis and Interpretation] jmflanig
@@ Line 38: / Line 38: @@
   * [[https://arxiv.org/pdf/2310.07923|Merrill & Sabharwal 2024 - The Expressive Power of Transformers with Chain of Thought]] CoT increases the expressive power of Transformers, to recognizing regular languages (length of CoT linear in the input length), to recognizing exactly the class of polynomial-time solvable problems (length of CoT polynomial to the input length + PreNorm)
   * [[https://arxiv.org/pdf/2505.18948|Merrill & Sabharwal 2025 - Exact Expressive Power of Transformers with Padding]]
+  * **[[https://arxiv.org/pdf/2505.23623|Li & Cotterell 2025 - Characterizing the Expressivity of Transformer Language Models]]**
 ===== Analysis and Interpretation =====
@@ Line 46: / Line 46: @@
   * [[https://arxiv.org/pdf/2008.02217.pdf|Ramsauer et al 2020 - Hopfield Networks is All You Need]]
   * [[https://arxiv.org/pdf/2012.14913.pdf|Geva et al 2020 - Transformer Feed-Forward Layers Are Key-Value Memories]]
+  * [[https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens|2020 - The Logit Lens]] Used many places, see [[https://arxiv.org/pdf/2503.11667|LogitLens4LLMs]] for some examples
   * [[https://arxiv.org/pdf/2310.03686|Langedijk et al 2023 - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers]]
   * **For decoders/LLMs**
@@ Line 80: / Line 81: @@
   * [[https://arxiv.org/pdf/2006.16668.pdf|Lepikhin, et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]]
   * [[https://arxiv.org/pdf/2110.01786.pdf|Zhang et al 2021 - MoEfication: Transformer Feed-forward Layers are Mixtures of Experts]]
+  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]
 ===== Ablation Experiments on the Transformer =====