User Tools

Site Tools


ml:mechanistic_interpretability

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:mechanistic_interpretability [2025/05/16 05:21] – [Papers] jmflanigml:mechanistic_interpretability [2025/06/02 11:23] (current) – [Papers] jmflanig
Line 12: Line 12:
   * [[https://arxiv.org/pdf/2211.00593|Wang et al 2022 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small]]   * [[https://arxiv.org/pdf/2211.00593|Wang et al 2022 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small]]
   * **[[https://arxiv.org/pdf/2304.14997|Conmy et al 2023 - Towards Automated Circuit Discovery for Mechanistic Interpretability]]**   * **[[https://arxiv.org/pdf/2304.14997|Conmy et al 2023 - Towards Automated Circuit Discovery for Mechanistic Interpretability]]**
 +  * [[https://arxiv.org/pdf/2304.14767|Geva et al 2023 - Dissecting Recall of Factual Associations in Auto-Regressive Language Models]]
   * [[https://arxiv.org/pdf/2305.00586|Hanna et al 2023 - How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model]]   * [[https://arxiv.org/pdf/2305.00586|Hanna et al 2023 - How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model]]
   * [[https://arxiv.org/pdf/2404.14349|Rajaram et al 2024 - Automatic Discovery of Visual Circuits]]   * [[https://arxiv.org/pdf/2404.14349|Rajaram et al 2024 - Automatic Discovery of Visual Circuits]]
   * [[https://openreview.net/pdf?id=PRZ89QElAv|2025 - The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models]] (This paper was rejected from TMLR, see [[https://openreview.net/forum?id=PRZ89QElAv|reviews]])   * [[https://openreview.net/pdf?id=PRZ89QElAv|2025 - The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models]] (This paper was rejected from TMLR, see [[https://openreview.net/forum?id=PRZ89QElAv|reviews]])
   * [[https://arxiv.org/pdf/2405.13868|Ge et al 2024 - Automatically Identifying Local and Global Circuits with Linear Computation Graphs]]   * [[https://arxiv.org/pdf/2405.13868|Ge et al 2024 - Automatically Identifying Local and Global Circuits with Linear Computation Graphs]]
 +  * [[https://arxiv.org/pdf/2505.16538|Nie et al 2025 - Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models]]
   * **Induction Heads**   * **Induction Heads**
     * [[https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html|Olsson et al 2022 - In-context Learning and Induction Heads]] See the first figure for the description of an induction head.     * [[https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html|Olsson et al 2022 - In-context Learning and Induction Heads]] See the first figure for the description of an induction head.
     * [[https://arxiv.org/pdf/2404.07129|Singh et al 2024 - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation]] See figure 1a for an induction head example.     * [[https://arxiv.org/pdf/2404.07129|Singh et al 2024 - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation]] See figure 1a for an induction head example.
 +    * [[https://arxiv.org/pdf/2504.00132|Bakalova et al 2025 - Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B]]
 +    * [[https://arxiv.org/pdf/2505.16694|Minegishi et al 2025 - Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence]]
 +    * [[https://arxiv.org/pdf/2505.20896|Wu et al 2025 - How Do Transformers Learn Variable Binding in Symbolic Programs?]]
   * **Explaining Neurons**   * **Explaining Neurons**
-    * [[https://transluce.org/neuron-descriptions|Choi et al 2024 - Scaling Automatic Neuron Description]]+    * [[https://transluce.org/neuron-descriptions|Choi et al 2024 - Scaling Automatic Neuron Description]] They released a [[https://github.com/TransluceAI/observatory|database]] of descriptions of every neuron inside Llama-3.1-8B-Instruct.
  
 ===== Sparse Autoencoders ===== ===== Sparse Autoencoders =====
 This section should maybe be moved into its own page. This section should maybe be moved into its own page.
 +
 +See also [[nlp:language_model#steering|Language Model - Steering]]
  
   * [[https://arxiv.org/pdf/2309.08600|Cunningham et al 2023 - Sparse Autoencoders Find Highly Interpretable Features in Language Models]]   * [[https://arxiv.org/pdf/2309.08600|Cunningham et al 2023 - Sparse Autoencoders Find Highly Interpretable Features in Language Models]]
Line 29: Line 36:
   * [[https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html|2024 - Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet]]   * [[https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html|2024 - Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet]]
  
 +===== Resources =====
 +  * **Research Threads and Blogs**
 +    * [[https://transformer-circuits.pub/|Transformer Circuits Thread]]
 +    * [[https://transluce.org/our-work|Transluce's Research Thread]]
 +  * **Companies**
 +    * [[https://www.anthropic.com/|Anthropic]]
 +    * [[https://transluce.org/|Transluce]]
  
 ===== Related Pages ===== ===== Related Pages =====
 +  * [[nlp:Alignment]]
   * [[nlp:bert_and_friends#interpretation_and_properties_bertology|BERTology]]   * [[nlp:bert_and_friends#interpretation_and_properties_bertology|BERTology]]
   * [[nlp:Explainability]]   * [[nlp:Explainability]]
 +  * [[nlp:LLM Safety]]
   * [[Neural Network Psychology]]   * [[Neural Network Psychology]]
   * [[nlp:Probing Experiments]]   * [[nlp:Probing Experiments]]
   * [[Trustworthy AI]]   * [[Trustworthy AI]]
   * [[nlp:transformers#Analysis and Interpretation|Transformers - Analysis and Interpretation]]   * [[nlp:transformers#Analysis and Interpretation|Transformers - Analysis and Interpretation]]
ml/mechanistic_interpretability.1747372909.txt.gz · Last modified: 2025/05/16 05:21 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki