Table of Contents

Evaluation

Natural Language Output

To evaluate natural language output, researchers often use BLEU or human evaluation. For summarization, they often use ROUGE.

See also Generation - Evaluation, Machine Translation - Evaluation, and Dialog - Evaluation.

Papers

Evaluation with Large Language Models

Robust Evaluation

See also Generation - Evaluation, Machine Translation - Evaluation, and Dialog - Evaluation.