====== Explainability ======
Explainability can be crucial for the adoption of automatic methods.  For example, without an explaination for the diagnosis, doctors are highly unlikely to use an automatic diagnosis system.  Explainability is an open problem for machine learning and NLP (see [[open problems]]).  See also [[https://en.wikipedia.org/wiki/Explainable_artificial_intelligence|Wikipedia - Explainable AI]].

===== Explainability in Neural Networks =====

==== Surveys ====
  * [[https://arxiv.org/pdf/2012.14261.pdf|Zhang et al 2020 - A Survey on Neural Network Interpretability]]
  * [[https://arxiv.org/pdf/2010.00389.pdf|Thayaparan et al 2020 - A Survey on Explainability in Machine Reading Comprehension]]
  * [[https://arxiv.org/pdf/2207.13243|Rauker et al 2022 - Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks]]

==== Papers ====
  * LIME: [[https://arxiv.org/pdf/1602.04938.pdf|Ribeiro et al 2016 - "Why Should I Trust You?": Explaining the Predictions of Any Classifier]] A very common method, works pretty well. A good baseline
  * [[https://arxiv.org/pdf/1612.08220.pdf|Li et al 2016 - Understanding Neural Networks through Representation Erasure]]
  * [[https://www.robots.ox.ac.uk/~vedaldi/assets/pubs/fong17interpretable.pdf|Fong 2017 - Interpretable Explanations of Black Boxes by Meaningful Perturbation]]
  * [[https://arxiv.org/pdf/1703.04730.pdf|Koh & Liang 2017 - Understanding Black-box Predictions via Influence Functions]]
  * [[https://arxiv.org/pdf/1902.10186.pdf|Jain & Wallace 2017 - Attention is not Explanation]] and [[https://arxiv.org/pdf/1908.04626.pdf|Wiegreffe & Pinter 2019 - Attention is not not Explanation]]
  * [[https://arxiv.org/pdf/1904.00045.pdf|Burns et al 2019 - Interpreting Black Box Models via Hypothesis Testing]] Reframes "black box model interpretability as a multiple hypothesis testing problem. The task is to discover “important” features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with uninformative counterfactuals."
  * [[https://arxiv.org/pdf/2005.06676.pdf|Han et al 2020 - Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions]]
  * Pruning can be used for interpretability, see the use of SparseVD here: [[https://arxiv.org/pdf/2110.03192.pdf|Wang et al 2021 - GNN is a Counter? Revisiting GNN for Question Answering]]
  * [[https://arxiv.org/pdf/2302.02169.pdf|Yang et al 2023 - How Many and Which Training Points Would Need to be Removed to Flip this Prediction?]]

Jeff's opinion: I have reservations about the gradient-based methods because a small effect of an infinitesimal change doesn't necessarily mean it isn't important - it could be important but saturate the activation function to produce a flat spot in the gradient.  I prefer methods like [[https://arxiv.org/pdf/1612.08220.pdf|Li et al 2016 - Understanding Neural Networks through Representation Erasure]] and [[https://arxiv.org/pdf/1904.00045.pdf|Burns et al 2019 - Interpreting Black Box Models via Hypothesis Testing]].

==== Converting Neural Networks to Decision Trees ====
Overview blog post: [[https://bair.berkeley.edu/blog/2020/04/23/decisions/|2020 - Making Decision Trees Accurate Again: Explaining What Explainable AI Did Not]]
  * [[https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.16.6788&rep=rep1&type=pdf|Boz 2000 - Converting A Trained Neural Network To A Decision Tree DecText - Decision Tree Extractor]]
  * [[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.6011&rep=rep1&type=pdf|Boz 2002 - Extracting Decision Trees From Trained Neural Networks]]
  * [[https://arxiv.org/pdf/1711.09784.pdf|Frosst & Hinton 2017 - Distilling a Neural Network Into a Soft Decision Tree]]
  * [[https://arxiv.org/pdf/2004.00221.pdf|Wan et al 2020 - NBDT: Neural-Backed Decision Trees]]

===== Explainable NLP =====
  * [[https://arxiv.org/pdf/2009.06354.pdf|Lamm et al 2020 - QED: A Framework and Dataset for Explanations in Question Answering]]

===== Interpretability and Explainability In LLMs =====
  * **Overviews**
    * [[https://arxiv.org/pdf/2401.12874|Luo & Specia 2024 - From Understanding to Utilization: A Survey on Explainability for Large Language Models]]
    * [[https://arxiv.org/pdf/2402.10688|Zhao et al 2024 - Towards Uncovering How Large Language Model Works: An Explainability Perspective]] This is an ok paper, but it cites almost none of the work before 2021 or work outside of the mechanistic interpretability literature.
  * **Resources**
    * [[https://burnycoder.github.io/Landing/Contents/Exobrain/Topics/Mechanistic%20interpretability/|Paper list]]
  * **Papers**
    * [[https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html|Bills et al 2023 - Language models can explain neurons in language models]]
    * [[https://arxiv.org/pdf/2305.08809|Wu et al 2023 - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca]]
    * [[https://arxiv.org/pdf/2305.19911|Foote et al 2023 - Neuron to Graph: Interpreting Language Model Neurons at Scale]]

===== Natural Language Explanations =====
  * For NLI, see [[nlp:entailment#Entailment - Natural Language Explanations]]
  * [[https://arxiv.org/pdf/2112.08674.pdf|Wiegreffe et al 2021 - Reframing Human-AI Collaboration for Generating Free-Text Explanations]]
  * **On out-of-domain data**
    * [[https://aclanthology.org/2021.insights-1.17.pdf|Zhou and Tan 2021 - Investigating the Effect of Natural Language Explanations on Out-of-Distribution Generalization in Few-shot NLI]]
    * [[https://aclanthology.org/2022.acl-long.477.pdf|Chrysostomou & Aletras 2022 - An Empirical Study on Explanations in Out-of-Domain Settings]] Does it on text classification
  * **Making it more Robust**
    * [[https://arxiv.org/pdf/2305.04990.pdf|Ludan et al 2023 - Explanation-based Finetuning Makes Models More Robust to Spurious Cues]]

===== Evaluating Explanations =====
Overview: [[https://arxiv.org/pdf/2004.03685.pdf|Jacovi 2020]]
  * [[https://arxiv.org/pdf/2004.03685.pdf|Jacovi & Goldberg 2020 - Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?]]
  * [[https://arxiv.org/pdf/2012.00893.pdf|Pruthi et al 2020 - Evaluating Explanations: How much do explanations from the teacher aid students?]]
===== Conferences, Workshops, and Shared Tasks =====
  * [[https://blackboxnlp.github.io/|BlackBoxNLP Workshop]] 2020: [[https://blackboxnlp.github.io/2020/|website]] [[https://virtual.2020.emnlp.org/workshop_WS-25.html|papers]] 2019: [[https://blackboxnlp.github.io/2019/|website]] [[https://www.aclweb.org/anthology/volumes/W19-48/|papers]] 2018: [[https://blackboxnlp.github.io/2018/|website]] [[https://www.aclweb.org/anthology/volumes/W18-54/|papers]]

===== Related Pages =====
  * [[ml:Mechanistic Interpretability]]
  * [[ml:Neural Network Psychology]]
  * [[Probing Experiments]]
  * [[Reasoning#Reasoning Chains|Reasoning - Reasoning Chains]]
  * [[ml:Trustworthy AI]]
  * [[ml:Visualizing Neural Networks]]