Mechanistic Interpretability

Mechanistic Interpretability

Mechanistic interpretability research has been done in NLP before the term was invented, under other names. See Mechanistic? for important historical context.

Overviews

Bereska & Gavves 2024 - Mechanistic Interpretability for AI Safety - A Review
Rai et al 2024 - A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models paper list: github
Sharkey et al 2025 - Open Problems in Mechanistic Interpretability
Lin et al 2025 - A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

Papers

See also the papers in BERTology, Neural Network Psychology, Probing Experiments, Transformers - Analysis and Interpretation.

Wang et al 2022 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
Conmy et al 2023 - Towards Automated Circuit Discovery for Mechanistic Interpretability
Geva et al 2023 - Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Hanna et al 2023 - How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Rajaram et al 2024 - Automatic Discovery of Visual Circuits
2025 - The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models (This paper was rejected from TMLR, see reviews)
Ge et al 2024 - Automatically Identifying Local and Global Circuits with Linear Computation Graphs
Nie et al 2025 - Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
Induction Heads
- Olsson et al 2022 - In-context Learning and Induction Heads See the first figure for the description of an induction head.
- Singh et al 2024 - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation See figure 1a for an induction head example.
- Bakalova et al 2025 - Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
- Minegishi et al 2025 - Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence
- Wu et al 2025 - How Do Transformers Learn Variable Binding in Symbolic Programs?
Explaining Neurons
- Choi et al 2024 - Scaling Automatic Neuron Description They released a database of descriptions of every neuron inside Llama-3.1-8B-Instruct.

Sparse Autoencoders

This section should maybe be moved into its own page.

Resources

Research Threads and Blogs
- Transformer Circuits Thread
- Transluce's Research Thread
Companies
- Anthropic
- Transluce

Table of Contents

Mechanistic Interpretability

Overviews

Papers

Sparse Autoencoders

Resources

Related Pages