====== Mechanistic Interpretability ====== Mechanistic interpretability research has been done in NLP before the term was invented, under other names. See [[https://arxiv.org/pdf/2410.09087|Mechanistic?]] for important historical context. ===== Overviews ===== * [[https://arxiv.org/pdf/2404.14082|Bereska & Gavves 2024 - Mechanistic Interpretability for AI Safety - A Review]] * [[https://arxiv.org/pdf/2407.02646|Rai et al 2024 - A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models]] paper list: [[https://github.com/Dakingrai/awesome-mechanistic-interpretability-lm-papers|github]] * [[https://arxiv.org/pdf/2501.16496|Sharkey et al 2025 - Open Problems in Mechanistic Interpretability]] * [[https://arxiv.org/pdf/2502.17516|Lin et al 2025 - A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models]] ===== Papers ===== See also the papers in [[nlp:bert_and_friends#interpretation_and_properties_bertology|BERTology]], [[Neural Network Psychology]], [[nlp:Probing Experiments]], [[nlp:transformers#Analysis and Interpretation|Transformers - Analysis and Interpretation]]. * [[https://arxiv.org/pdf/2211.00593|Wang et al 2022 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small]] * **[[https://arxiv.org/pdf/2304.14997|Conmy et al 2023 - Towards Automated Circuit Discovery for Mechanistic Interpretability]]** * [[https://arxiv.org/pdf/2304.14767|Geva et al 2023 - Dissecting Recall of Factual Associations in Auto-Regressive Language Models]] * [[https://arxiv.org/pdf/2305.00586|Hanna et al 2023 - How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model]] * [[https://arxiv.org/pdf/2404.14349|Rajaram et al 2024 - Automatic Discovery of Visual Circuits]] * [[https://openreview.net/pdf?id=PRZ89QElAv|2025 - The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models]] (This paper was rejected from TMLR, see [[https://openreview.net/forum?id=PRZ89QElAv|reviews]]) * [[https://arxiv.org/pdf/2405.13868|Ge et al 2024 - Automatically Identifying Local and Global Circuits with Linear Computation Graphs]] * [[https://arxiv.org/pdf/2505.16538|Nie et al 2025 - Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models]] * **Induction Heads** * [[https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html|Olsson et al 2022 - In-context Learning and Induction Heads]] See the first figure for the description of an induction head. * [[https://arxiv.org/pdf/2404.07129|Singh et al 2024 - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation]] See figure 1a for an induction head example. * [[https://arxiv.org/pdf/2504.00132|Bakalova et al 2025 - Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B]] * [[https://arxiv.org/pdf/2505.16694|Minegishi et al 2025 - Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence]] * [[https://arxiv.org/pdf/2505.20896|Wu et al 2025 - How Do Transformers Learn Variable Binding in Symbolic Programs?]] * **Explaining Neurons** * [[https://transluce.org/neuron-descriptions|Choi et al 2024 - Scaling Automatic Neuron Description]] They released a [[https://github.com/TransluceAI/observatory|database]] of descriptions of every neuron inside Llama-3.1-8B-Instruct. ===== Sparse Autoencoders ===== This section should maybe be moved into its own page. See also [[nlp:language_model#steering|Language Model - Steering]] * [[https://arxiv.org/pdf/2309.08600|Cunningham et al 2023 - Sparse Autoencoders Find Highly Interpretable Features in Language Models]] * [[https://transformer-circuits.pub/2023/monosemantic-features|2023 - Towards Monosemanticity: Decomposing Language Models With Dictionary Learning]] * [[https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html|2024 - Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet]] ===== Resources ===== * **Research Threads and Blogs** * [[https://transformer-circuits.pub/|Transformer Circuits Thread]] * [[https://transluce.org/our-work|Transluce's Research Thread]] * **Companies** * [[https://www.anthropic.com/|Anthropic]] * [[https://transluce.org/|Transluce]] ===== Related Pages ===== * [[nlp:Alignment]] * [[nlp:bert_and_friends#interpretation_and_properties_bertology|BERTology]] * [[nlp:Explainability]] * [[nlp:LLM Safety]] * [[Neural Network Psychology]] * [[nlp:Probing Experiments]] * [[Trustworthy AI]] * [[nlp:transformers#Analysis and Interpretation|Transformers - Analysis and Interpretation]]