====== Mechanistic Interpretability ======
Mechanistic interpretability research has been done in NLP before the term was invented, under other names.  See [[https://arxiv.org/pdf/2410.09087|Mechanistic?]] for important historical context.
===== Overviews =====
  * [[https://arxiv.org/pdf/2404.14082|Bereska & Gavves 2024 - Mechanistic Interpretability for AI Safety - A Review]]
  * [[https://arxiv.org/pdf/2407.02646|Rai et al 2024 - A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models]] paper list: [[https://github.com/Dakingrai/awesome-mechanistic-interpretability-lm-papers|github]]
  * [[https://arxiv.org/pdf/2501.16496|Sharkey et al 2025 - Open Problems in Mechanistic Interpretability]]
  * [[https://arxiv.org/pdf/2502.17516|Lin et al 2025 - A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models]]

===== Papers =====
See also the papers in [[nlp:bert_and_friends#interpretation_and_properties_bertology|BERTology]], [[Neural Network Psychology]], [[nlp:Probing Experiments]], [[nlp:transformers#Analysis and Interpretation|Transformers - Analysis and Interpretation]].

  * [[https://arxiv.org/pdf/2211.00593|Wang et al 2022 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small]]
  * **[[https://arxiv.org/pdf/2304.14997|Conmy et al 2023 - Towards Automated Circuit Discovery for Mechanistic Interpretability]]**
  * [[https://arxiv.org/pdf/2304.14767|Geva et al 2023 - Dissecting Recall of Factual Associations in Auto-Regressive Language Models]]
  * [[https://arxiv.org/pdf/2305.00586|Hanna et al 2023 - How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model]]
  * [[https://arxiv.org/pdf/2404.14349|Rajaram et al 2024 - Automatic Discovery of Visual Circuits]]
  * [[https://openreview.net/pdf?id=PRZ89QElAv|2025 - The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models]] (This paper was rejected from TMLR, see [[https://openreview.net/forum?id=PRZ89QElAv|reviews]])
  * [[https://arxiv.org/pdf/2405.13868|Ge et al 2024 - Automatically Identifying Local and Global Circuits with Linear Computation Graphs]]
  * [[https://arxiv.org/pdf/2505.16538|Nie et al 2025 - Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models]]
  * **Induction Heads**
    * [[https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html|Olsson et al 2022 - In-context Learning and Induction Heads]] See the first figure for the description of an induction head.
    * [[https://arxiv.org/pdf/2404.07129|Singh et al 2024 - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation]] See figure 1a for an induction head example.
    * [[https://arxiv.org/pdf/2504.00132|Bakalova et al 2025 - Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B]]
    * [[https://arxiv.org/pdf/2505.16694|Minegishi et al 2025 - Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence]]
    * [[https://arxiv.org/pdf/2505.20896|Wu et al 2025 - How Do Transformers Learn Variable Binding in Symbolic Programs?]]
  * **Explaining Neurons**
    * [[https://transluce.org/neuron-descriptions|Choi et al 2024 - Scaling Automatic Neuron Description]] They released a [[https://github.com/TransluceAI/observatory|database]] of descriptions of every neuron inside Llama-3.1-8B-Instruct.

===== Sparse Autoencoders =====
This section should maybe be moved into its own page.

See also [[nlp:language_model#steering|Language Model - Steering]]

  * [[https://arxiv.org/pdf/2309.08600|Cunningham et al 2023 - Sparse Autoencoders Find Highly Interpretable Features in Language Models]]
  * [[https://transformer-circuits.pub/2023/monosemantic-features|2023 - Towards Monosemanticity: Decomposing Language Models With Dictionary Learning]]
  * [[https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html|2024 - Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet]]

===== Resources =====
  * **Research Threads and Blogs**
    * [[https://transformer-circuits.pub/|Transformer Circuits Thread]]
    * [[https://transluce.org/our-work|Transluce's Research Thread]]
  * **Companies**
    * [[https://www.anthropic.com/|Anthropic]]
    * [[https://transluce.org/|Transluce]]

===== Related Pages =====
  * [[nlp:Alignment]]
  * [[nlp:bert_and_friends#interpretation_and_properties_bertology|BERTology]]
  * [[nlp:Explainability]]
  * [[nlp:LLM Safety]]
  * [[Neural Network Psychology]]
  * [[nlp:Probing Experiments]]
  * [[Trustworthy AI]]
  * [[nlp:transformers#Analysis and Interpretation|Transformers - Analysis and Interpretation]]