nlp:evaluation

This is an old revision of the document!

Table of Contents

Evaluation

Evaluation

Natural Language Output

To evaluate natural language output, researchers often use BLEU or human evaluation. For summarization, they often use ROUGE.

See also Generation - Evaluation, Machine Translation - Evaluation, and Dialog - Evaluation.

Papers

Rodriguez et al 2021 - Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards?

Evaluation with Large Language Models

Chiang & Lee 2023 - Can Large Language Models Be an Alternative to Human Evaluation?
Zheng et al 2023 - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Panickssery et al 2024 - LLM Evaluators Recognize and Favor Their Own Generations

Robust Evaluation

Ribeiro et al 2020 - Beyond Accuracy: Behavioral Testing of NLP Models with CheckList Very good paper, best paper award at ACL 2020.

See also Generation - Evaluation, Machine Translation - Evaluation, and Dialog - Evaluation.

Related Pages

Experimental Method and Reproducibility
Natural Language Output
- Generation - Evaluation
- Machine Translation - Evaluation
- Dialog - Evaluation
Question Answering - Evaluation

nlp/evaluation.1718486164.txt.gz · Last modified: 2024/06/15 21:16 by jmflanig