ml:mechanistic_interpretability

Table of Contents

Mechanistic Interpretability

Mechanistic Interpretability

Mechanistic interpretability research has been done in NLP before the term was invented, under other names. See Mechanistic? for important historical context.

Overviews

Bereska & Gavves 2024 - Mechanistic Interpretability for AI Safety - A Review
Rai et al 2024 - A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models paper list: github
Sharkey et al 2025 - Open Problems in Mechanistic Interpretability
Lin et al 2025 - A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

Papers

See also the papers in BERTology, Neural Network Psychology, Probing Experiments, Transformers - Analysis and Interpretation.

Wang et al 2022 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
Conmy et al 2023 - Towards Automated Circuit Discovery for Mechanistic Interpretability
Geva et al 2023 - Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Hanna et al 2023 - How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Rajaram et al 2024 - Automatic Discovery of Visual Circuits
2025 - The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models (This paper was rejected from TMLR, see reviews)
Ge et al 2024 - Automatically Identifying Local and Global Circuits with Linear Computation Graphs
Nie et al 2025 - Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
Induction Heads
- Olsson et al 2022 - In-context Learning and Induction Heads See the first figure for the description of an induction head.
- Singh et al 2024 - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation See figure 1a for an induction head example.
- Bakalova et al 2025 - Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
- Minegishi et al 2025 - Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence
- Wu et al 2025 - How Do Transformers Learn Variable Binding in Symbolic Programs?
Explaining Neurons
- Choi et al 2024 - Scaling Automatic Neuron Description They released a database of descriptions of every neuron inside Llama-3.1-8B-Instruct.

Sparse Autoencoders

This section should maybe be moved into its own page.

See also Language Model - Steering

Resources

Research Threads and Blogs
- Transformer Circuits Thread
- Transluce's Research Thread
Companies
- Anthropic
- Transluce

Related Pages

ml/mechanistic_interpretability.txt · Last modified: 2025/06/02 11:23 by jmflanig