Table of Contents

Evaluation

Evaluation

Natural Language Output

To evaluate natural language output, researchers often use BLEU or human evaluation. For summarization, they often use ROUGE.

See also Generation - Evaluation, Machine Translation - Evaluation, and Dialog - Evaluation.

Papers

Rodriguez et al 2021 - Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards?

Evaluation with Large Language Models

Overviews
- Gu et al 2024 - A Survey on LLM-as-a-Judge
- Blog: 2024 - Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
Wang et al 2023 - Large Language Models are not Fair Evaluators
Chiang & Lee 2023 - Can Large Language Models Be an Alternative to Human Evaluation?
Zheng et al 2023 - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Panickssery et al 2024 - LLM Evaluators Recognize and Favor Their Own Generations
Yuan et al 2025 - Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

Robust Evaluation

Ribeiro et al 2020 - Beyond Accuracy: Behavioral Testing of NLP Models with CheckList Very good paper, best paper award at ACL 2020.

See also Generation - Evaluation, Machine Translation - Evaluation, and Dialog - Evaluation.

Related Pages

Experimental Method and Reproducibility
Natural Language Output
- Generation - Evaluation
- Machine Translation - Evaluation
- Dialog - Evaluation
Question Answering - Evaluation