Confidence

Evaluation Measures

TODO: literature review for evaluation measures of confidence scores.

In NLP

(search ACL Anthology for “confidence scores”)

Culotta & McCallum 2003 - Confidence Estimation for Information Extraction Uses three evaluation metrics of confidence scores:
- “Pearson’s r, a correlation coefficient ranging from -1 to 1 that measures the correlation between a confidence score and whether or not the field (or record) is correctly labeled.”
- “average precision, used in the Information Retrieval community… the precision at each point in the ranked list where a relevant document is found and then averages these values. Instead of ranking documents by their relevance score, here we rank fields (and records) by their confidence score, where a correctly labeled field is analogous to a relevant document”
- “accuracy-coverage graph. Better confidence estimates push the curve to the upper-right” Precision-recall curve. See fig 1.
2004 - Confidence Estimation for Machine Translation
2008 - Computing Confidence Scores for All Sub Parse Trees
Nguyen Bach 2011 - Goodness: A Method for Measuring Machine Translation Confidence Has a good explanation of MT confidence
2012 - Are You Sure? Confidence in Prediction of Dependency Tree Edges
2018 - Confidence Modeling for Neural Semantic Parsing Measures “the relationship between confidence scores and F1 using Spearman’s ρ correlation coefficient which varies between −1 and 1 (0 implies there is no correlation).”
2019 - Modeling Confidence in Sequence-to-Sequence Models
2020 - Calibrating Structured Output Predictors for Natural Language Processing
2020 - Selective Question Answering under Domain Shift
Lahlou et al 2021 - DEUP: Direct Epistemic Uncertainty Prediction
Zerva et al 2022 - Better Uncertainty Quantification for Machine Translation Evaluation See the related work
Xu et al 2025 - Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Khanmohammadi et al 2025 - Calibrating LLM Confidence by Probing Perturbed Representation Stability

Other Areas

Liang et al 2017 - Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks

NLP Wiki

Table of Contents

Confidence

Evaluation Measures

In NLP

Other Areas