Differences

This shows you the differences between two versions of the page.

--- ml:mechanistic_interpretability [2025/05/16 05:21] – [Papers] jmflanig
+++ ml:mechanistic_interpretability [2025/06/02 11:23] (current) – [Papers] jmflanig
@@ Line 12: / Line 12: @@
   * [[https://arxiv.org/pdf/2211.00593|Wang et al 2022 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small]]
   * **[[https://arxiv.org/pdf/2304.14997|Conmy et al 2023 - Towards Automated Circuit Discovery for Mechanistic Interpretability]]**
+  * [[https://arxiv.org/pdf/2304.14767|Geva et al 2023 - Dissecting Recall of Factual Associations in Auto-Regressive Language Models]]
   * [[https://arxiv.org/pdf/2305.00586|Hanna et al 2023 - How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model]]
   * [[https://arxiv.org/pdf/2404.14349|Rajaram et al 2024 - Automatic Discovery of Visual Circuits]]
   * [[https://openreview.net/pdf?id=PRZ89QElAv|2025 - The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models]] (This paper was rejected from TMLR, see [[https://openreview.net/forum?id=PRZ89QElAv|reviews]])
   * [[https://arxiv.org/pdf/2405.13868|Ge et al 2024 - Automatically Identifying Local and Global Circuits with Linear Computation Graphs]]
+  * [[https://arxiv.org/pdf/2505.16538|Nie et al 2025 - Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models]]
   * **Induction Heads**
     * [[https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html|Olsson et al 2022 - In-context Learning and Induction Heads]] See the first figure for the description of an induction head.
     * [[https://arxiv.org/pdf/2404.07129|Singh et al 2024 - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation]] See figure 1a for an induction head example.
+    * [[https://arxiv.org/pdf/2504.00132|Bakalova et al 2025 - Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B]]
+    * [[https://arxiv.org/pdf/2505.16694|Minegishi et al 2025 - Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence]]
+    * [[https://arxiv.org/pdf/2505.20896|Wu et al 2025 - How Do Transformers Learn Variable Binding in Symbolic Programs?]]
   * **Explaining Neurons**
-    * [[https://transluce.org/neuron-descriptions|Choi et al 2024 - Scaling Automatic Neuron Description]]
+    * [[https://transluce.org/neuron-descriptions|Choi et al 2024 - Scaling Automatic Neuron Description]] They released a [[https://github.com/TransluceAI/observatory|database]] of descriptions of every neuron inside Llama-3.1-8B-Instruct.
 ===== Sparse Autoencoders =====
 This section should maybe be moved into its own page.
+See also [[nlp:language_model#steering|Language Model - Steering]]
   * [[https://arxiv.org/pdf/2309.08600|Cunningham et al 2023 - Sparse Autoencoders Find Highly Interpretable Features in Language Models]]
@@ Line 29: / Line 36: @@
   * [[https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html|2024 - Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet]]
+===== Resources =====
+  * **Research Threads and Blogs**
+    * [[https://transformer-circuits.pub/|Transformer Circuits Thread]]
+    * [[https://transluce.org/our-work|Transluce's Research Thread]]
+  * **Companies**
+    * [[https://www.anthropic.com/|Anthropic]]
+    * [[https://transluce.org/|Transluce]]
 ===== Related Pages =====
+  * [[nlp:Alignment]]
   * [[nlp:bert_and_friends#interpretation_and_properties_bertology|BERTology]]
   * [[nlp:Explainability]]
+  * [[nlp:LLM Safety]]
   * [[Neural Network Psychology]]
   * [[nlp:Probing Experiments]]
   * [[Trustworthy AI]]
   * [[nlp:transformers#Analysis and Interpretation|Transformers - Analysis and Interpretation]]