====== Evaluation ====== ===== Natural Language Output ===== To evaluate natural language output, researchers often use BLEU or human evaluation. For summarization, they often use ROUGE. See also Generation - [[Generation#Evaluation]], Machine Translation - [[Machine Translation#Evaluation]], and Dialog - [[Dialog#Evaluation]]. ===== Papers ===== * [[https://aclanthology.org/2021.acl-long.346.pdf|Rodriguez et al 2021 - Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards?]] ===== Evaluation with Large Language Models ===== * **Overviews** * [[https://arxiv.org/pdf/2411.15594|Gu et al 2024 - A Survey on LLM-as-a-Judge]] * Blog: [[https://eugeneyan.com/writing/llm-evaluators/|2024 - Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)]] * [[https://arxiv.org/pdf/2305.17926|Wang et al 2023 - Large Language Models are not Fair Evaluators]] * [[https://arxiv.org/pdf/2305.01937.pdf|Chiang & Lee 2023 - Can Large Language Models Be an Alternative to Human Evaluation?]] * **[[https://arxiv.org/pdf/2306.05685.pdf|Zheng et al 2023 - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena]]** * **[[https://arxiv.org/pdf/2404.13076|Panickssery et al 2024 - LLM Evaluators Recognize and Favor Their Own Generations]]** * [[https://arxiv.org/pdf/2505.20738|Yuan et al 2025 - Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator]] ===== Robust Evaluation ===== * **[[https://arxiv.org/pdf/2005.04118.pdf|Ribeiro et al 2020 - Beyond Accuracy: Behavioral Testing of NLP Models with CheckList]]** Very good paper, best paper award at ACL 2020. See also Generation - [[Generation#Evaluation]], Machine Translation - [[Machine Translation#Evaluation]], and Dialog - [[Dialog#Evaluation]]. ===== Related Pages ===== * [[Experimental Method|Experimental Method and Reproducibility]] * Natural Language Output * Generation - [[Generation#Evaluation]] * Machine Translation - [[Machine Translation#Evaluation]] * Dialog - [[Dialog#Evaluation]] * Question Answering - [[Question Answering#Evaluation]]