This is an old revision of the document!

Mechanistic Interpretability

Mechanistic interpretability research has been done in NLP before the term was invented, under other names. See Mechanistic? for important historical context.

Overviews

Bereska & Gavves 2024 - Mechanistic Interpretability for AI Safety - A Review
Rai et al 2024 - A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models paper list: github
Sharkey et al 2025 - Open Problems in Mechanistic Interpretability
Lin et al 2025 - A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

Papers

See also the papers in BERTology, Neural Network Psychology, Probing Experiments, Transformers - Analysis and Interpretation.

Wang et al 2022 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
Conmy et al 2023 - Towards Automated Circuit Discovery for Mechanistic Interpretability
Hanna et al 2023 - How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Rajaram et al 2024 - Automatic Discovery of Visual Circuits
2025 - The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models (This paper was rejected from TMLR, see reviews)
Ge et al 2024 - Automatically Identifying Local and Global Circuits with Linear Computation Graphs
Induction Heads
- Olsson et al 2022 - In-context Learning and Induction Heads See the first figure for the description of an induction head.
- Singh et al 2024 - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation See figure 1a for an induction head example.
Explaining Neurons
- Choi et al 2024 - Scaling Automatic Neuron Description

Sparse Autoencoders

This section should maybe be moved into its own page.

NLP Wiki

Table of Contents

Mechanistic Interpretability

Overviews

Papers

Sparse Autoencoders

Related Pages