To evaluate natural language output, researchers often use BLEU or human evaluation. For summarization, they often use ROUGE.
See also Generation - Evaluation, Machine Translation - Evaluation, and Dialog - Evaluation.
See also Generation - Evaluation, Machine Translation - Evaluation, and Dialog - Evaluation.