======  Confidence ======

===== Evaluation Measures =====
TODO: literature review for evaluation measures of confidence scores.

===== In NLP =====
(search [[https://www.aclweb.org/anthology/|ACL Anthology]] for "confidence scores")
  * [[https://www.aclweb.org/anthology/N04-4028.pdf|Culotta & McCallum 2003 - Confidence Estimation for Information Extraction]] Uses three evaluation metrics of confidence scores:
     * "Pearson’s r, a correlation coefficient ranging from -1 to 1 that measures the correlation between a confidence score and whether or not the field (or record) is correctly labeled."
     * "average precision, used in the Information Retrieval community... the precision at each point in the ranked list where a relevant document is found and then averages these values. Instead of ranking documents by their relevance score, here we rank fields (and records) by their confidence score, where a correctly labeled field is analogous to a relevant document"
     * "accuracy-coverage graph. Better confidence estimates push the curve to the upper-right"  Precision-recall curve.  See fig 1.
  * [[https://www.aclweb.org/anthology/C04-1046.pdf|2004 - Confidence Estimation for Machine Translation]]
  * [[https://www.aclweb.org/anthology/P08-2055.pdf|2008 - Computing Confidence Scores for All Sub Parse Trees]]
  * [[https://www.aclweb.org/anthology/P11-1022.pdf|Nguyen Bach 2011  - Goodness: A Method for Measuring Machine Translation Confidence]] Has a good explanation of MT confidence
  * [[https://www.aclweb.org/anthology/N12-1068.pdf|2012 - Are You Sure? Confidence in Prediction of Dependency Tree Edges]]
  * [[https://www.aclweb.org/anthology/P18-1069.pdf|2018 - Confidence Modeling for Neural Semantic Parsing]]  Measures "the relationship between confidence scores and F1 using Spearman’s ρ correlation coefficient which varies between −1 and 1 (0 implies there is no correlation)."
  * [[https://www.aclweb.org/anthology/W19-8671.pdf|2019 - Modeling Confidence in Sequence-to-Sequence Models]]
  * [[https://www.aclweb.org/anthology/2020.acl-main.188.pdf|2020 - Calibrating Structured Output Predictors for Natural Language Processing]]
  * [[https://arxiv.org/pdf/2006.09462.pdf|2020 - Selective Question Answering under Domain Shift]]
  * [[https://arxiv.org/pdf/2102.08501.pdf|Lahlou et al 2021 - DEUP: Direct Epistemic Uncertainty Prediction]]
  * [[https://arxiv.org/pdf/2204.06546.pdf|Zerva et al 2022 - Better Uncertainty Quantification for Machine Translation Evaluation]]  See the related work
  * [[https://aclanthology.org/2025.findings-acl.1316.pdf|Xu et al 2025 - Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs]]
  * [[https://aclanthology.org/2025.emnlp-main.530.pdf|Khanmohammadi et al 2025 - Calibrating LLM Confidence by Probing Perturbed Representation Stability]]

===== Other Areas =====
  * [[https://arxiv.org/pdf/1706.02690.pdf|Liang et al 2017 - Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks]]