ml:mechanistic_interpretability
Table of Contents
Mechanistic Interpretability
Mechanistic interpretability research has been done in NLP before the term was invented, under other names. See Mechanistic? for important historical context.
Overviews
Papers
See also the papers in BERTology, Neural Network Psychology, Probing Experiments, Transformers - Analysis and Interpretation.
- 2025 - The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models (This paper was rejected from TMLR, see reviews)
- Induction Heads
- Olsson et al 2022 - In-context Learning and Induction Heads See the first figure for the description of an induction head.
- Singh et al 2024 - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation See figure 1a for an induction head example.
- Explaining Neurons
- Choi et al 2024 - Scaling Automatic Neuron Description They released a database of descriptions of every neuron inside Llama-3.1-8B-Instruct.
Sparse Autoencoders
Resources
- Research Threads and Blogs
- Companies
Related Pages
ml/mechanistic_interpretability.txt · Last modified: 2025/06/02 11:23 by jmflanig