nlp:explainability

Table of Contents

Explainability

Explainability

Explainability can be crucial for the adoption of automatic methods. For example, without an explaination for the diagnosis, doctors are highly unlikely to use an automatic diagnosis system. Explainability is an open problem for machine learning and NLP (see open problems). See also Wikipedia - Explainable AI.

Explainability in Neural Networks

Surveys

Zhang et al 2020 - A Survey on Neural Network Interpretability
Thayaparan et al 2020 - A Survey on Explainability in Machine Reading Comprehension
Rauker et al 2022 - Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Papers

LIME: Ribeiro et al 2016 - "Why Should I Trust You?": Explaining the Predictions of Any Classifier A very common method, works pretty well. A good baseline
Li et al 2016 - Understanding Neural Networks through Representation Erasure
Fong 2017 - Interpretable Explanations of Black Boxes by Meaningful Perturbation
Koh & Liang 2017 - Understanding Black-box Predictions via Influence Functions
Jain & Wallace 2017 - Attention is not Explanation and Wiegreffe & Pinter 2019 - Attention is not not Explanation
Burns et al 2019 - Interpreting Black Box Models via Hypothesis Testing Reframes “black box model interpretability as a multiple hypothesis testing problem. The task is to discover “important” features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with uninformative counterfactuals.”
Han et al 2020 - Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions
Pruning can be used for interpretability, see the use of SparseVD here: Wang et al 2021 - GNN is a Counter? Revisiting GNN for Question Answering
Yang et al 2023 - How Many and Which Training Points Would Need to be Removed to Flip this Prediction?

Jeff's opinion: I have reservations about the gradient-based methods because a small effect of an infinitesimal change doesn't necessarily mean it isn't important - it could be important but saturate the activation function to produce a flat spot in the gradient. I prefer methods like Li et al 2016 - Understanding Neural Networks through Representation Erasure and Burns et al 2019 - Interpreting Black Box Models via Hypothesis Testing.

Converting Neural Networks to Decision Trees

Overview blog post: 2020 - Making Decision Trees Accurate Again: Explaining What Explainable AI Did Not

Explainable NLP

Lamm et al 2020 - QED: A Framework and Dataset for Explanations in Question Answering

Interpretability and Explainability In LLMs

Overviews
- Luo & Specia 2024 - From Understanding to Utilization: A Survey on Explainability for Large Language Models
- Zhao et al 2024 - Towards Uncovering How Large Language Model Works: An Explainability Perspective This is an ok paper, but it cites almost none of the work before 2021 or work outside of the mechanistic interpretability literature.
Resources
- Paper list
Papers

Natural Language Explanations

For NLI, see Entailment - Natural Language Explanations
Wiegreffe et al 2021 - Reframing Human-AI Collaboration for Generating Free-Text Explanations
On out-of-domain data
- Zhou and Tan 2021 - Investigating the Effect of Natural Language Explanations on Out-of-Distribution Generalization in Few-shot NLI
- Chrysostomou & Aletras 2022 - An Empirical Study on Explanations in Out-of-Domain Settings Does it on text classification
Making it more Robust
- Ludan et al 2023 - Explanation-based Finetuning Makes Models More Robust to Spurious Cues

Evaluating Explanations

Overview: Jacovi 2020

Conferences, Workshops, and Shared Tasks

BlackBoxNLP Workshop 2020: website papers 2019: website papers 2018: website papers

Related Pages

nlp/explainability.txt · Last modified: 2025/06/01 23:17 by jmflanig