Table of Contents

Mechanistic Interpretability

Mechanistic interpretability research has been done in NLP before the term was invented, under other names. See Mechanistic? for important historical context.

Overviews

Papers

See also the papers in BERTology, Neural Network Psychology, Probing Experiments, Transformers - Analysis and Interpretation.

Sparse Autoencoders

This section should maybe be moved into its own page.

See also Language Model - Steering

Resources