Table of Contents
Explainability
Explainability can be crucial for the adoption of automatic methods. For example, without an explaination for the diagnosis, doctors are highly unlikely to use an automatic diagnosis system. Explainability is an open problem for machine learning and NLP (see open problems). See also Wikipedia - Explainable AI.
Explainability in Neural Networks
Surveys
Papers
- LIME: Ribeiro et al 2016 - "Why Should I Trust You?": Explaining the Predictions of Any Classifier A very common method, works pretty well. A good baseline
- Burns et al 2019 - Interpreting Black Box Models via Hypothesis Testing Reframes “black box model interpretability as a multiple hypothesis testing problem. The task is to discover “important” features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with uninformative counterfactuals.”
- Pruning can be used for interpretability, see the use of SparseVD here: Wang et al 2021 - GNN is a Counter? Revisiting GNN for Question Answering
Jeff's opinion: I have reservations about the gradient-based methods because a small effect of an infinitesimal change doesn't necessarily mean it isn't important - it could be important but saturate the activation function to produce a flat spot in the gradient. I prefer methods like Li et al 2016 - Understanding Neural Networks through Representation Erasure and Burns et al 2019 - Interpreting Black Box Models via Hypothesis Testing.
Converting Neural Networks to Decision Trees
Overview blog post: 2020 - Making Decision Trees Accurate Again: Explaining What Explainable AI Did Not
Explainable NLP
Interpretability and Explainability In LLMs
- Overviews
- Zhao et al 2024 - Towards Uncovering How Large Language Model Works: An Explainability Perspective This is an ok paper, but it cites almost none of the work before 2021 or work outside of the mechanistic interpretability literature.
- Resources
- Papers
Natural Language Explanations
- For NLI, see Entailment - Natural Language Explanations
- On out-of-domain data
- Chrysostomou & Aletras 2022 - An Empirical Study on Explanations in Out-of-Domain Settings Does it on text classification
- Making it more Robust
Evaluating Explanations
Overview: Jacovi 2020