User Tools

Site Tools


nlp:transformers

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:transformers [2025/06/02 00:50] – [Transformer Properties] jmflanignlp:transformers [2025/10/17 20:09] (current) – [Analysis and Interpretation] jmflanig
Line 38: Line 38:
   * [[https://arxiv.org/pdf/2310.07923|Merrill & Sabharwal 2024 - The Expressive Power of Transformers with Chain of Thought]] CoT increases the expressive power of Transformers, to recognizing regular languages (length of CoT linear in the input length), to recognizing exactly the class of polynomial-time solvable problems (length of CoT polynomial to the input length + PreNorm)   * [[https://arxiv.org/pdf/2310.07923|Merrill & Sabharwal 2024 - The Expressive Power of Transformers with Chain of Thought]] CoT increases the expressive power of Transformers, to recognizing regular languages (length of CoT linear in the input length), to recognizing exactly the class of polynomial-time solvable problems (length of CoT polynomial to the input length + PreNorm)
   * [[https://arxiv.org/pdf/2505.18948|Merrill & Sabharwal 2025 - Exact Expressive Power of Transformers with Padding]]   * [[https://arxiv.org/pdf/2505.18948|Merrill & Sabharwal 2025 - Exact Expressive Power of Transformers with Padding]]
 +  * **[[https://arxiv.org/pdf/2505.23623|Li & Cotterell 2025 - Characterizing the Expressivity of Transformer Language Models]]**
  
 ===== Analysis and Interpretation ===== ===== Analysis and Interpretation =====
Line 46: Line 46:
   * [[https://arxiv.org/pdf/2008.02217.pdf|Ramsauer et al 2020 - Hopfield Networks is All You Need]]   * [[https://arxiv.org/pdf/2008.02217.pdf|Ramsauer et al 2020 - Hopfield Networks is All You Need]]
   * [[https://arxiv.org/pdf/2012.14913.pdf|Geva et al 2020 - Transformer Feed-Forward Layers Are Key-Value Memories]]   * [[https://arxiv.org/pdf/2012.14913.pdf|Geva et al 2020 - Transformer Feed-Forward Layers Are Key-Value Memories]]
 +  * [[https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens|2020 - The Logit Lens]] Used many places, see [[https://arxiv.org/pdf/2503.11667|LogitLens4LLMs]] for some examples
   * [[https://arxiv.org/pdf/2310.03686|Langedijk et al 2023 - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers]]   * [[https://arxiv.org/pdf/2310.03686|Langedijk et al 2023 - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers]]
   * **For decoders/LLMs**   * **For decoders/LLMs**
Line 80: Line 81:
   * [[https://arxiv.org/pdf/2006.16668.pdf|Lepikhin, et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]]   * [[https://arxiv.org/pdf/2006.16668.pdf|Lepikhin, et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]]
   * [[https://arxiv.org/pdf/2110.01786.pdf|Zhang et al 2021 - MoEfication: Transformer Feed-forward Layers are Mixtures of Experts]]   * [[https://arxiv.org/pdf/2110.01786.pdf|Zhang et al 2021 - MoEfication: Transformer Feed-forward Layers are Mixtures of Experts]]
 +  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]
  
 ===== Ablation Experiments on the Transformer ===== ===== Ablation Experiments on the Transformer =====
nlp/transformers.1748825404.txt.gz · Last modified: 2025/06/02 00:50 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki