User Tools

Site Tools


nlp:evaluation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:evaluation [2021/02/26 20:50] – [Natural Language Outputs] jmflanignlp:evaluation [2025/11/18 22:24] (current) – [Evaluation with Large Language Models] jmflanig
Line 1: Line 1:
 ====== Evaluation ====== ====== Evaluation ======
  
-===== Natural Language Outputs =====+===== Natural Language Output =====
 To evaluate natural language output, researchers often use BLEU or human evaluation. For summarization, they often use ROUGE. To evaluate natural language output, researchers often use BLEU or human evaluation. For summarization, they often use ROUGE.
  
-See also Generation:[[Generation#Evaluation]], Machine Translation:[[Machine Translation#Evaluation]], and Dialog:[[Dialog#Evaluation]].+See also Generation [[Generation#Evaluation]], Machine Translation [[Machine Translation#Evaluation]], and Dialog - [[Dialog#Evaluation]]. 
 + 
 +===== Papers ===== 
 +  * [[https://aclanthology.org/2021.acl-long.346.pdf|Rodriguez et al 2021 - Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards?]] 
 + 
 + 
 +===== Evaluation with Large Language Models ===== 
 +  * **Overviews** 
 +    * [[https://arxiv.org/pdf/2411.15594|Gu et al 2024 - A Survey on LLM-as-a-Judge]] 
 +    * Blog: [[https://eugeneyan.com/writing/llm-evaluators/|2024 - Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)]] 
 +  * [[https://arxiv.org/pdf/2305.17926|Wang et al 2023 - Large Language Models are not Fair Evaluators]] 
 +  * [[https://arxiv.org/pdf/2305.01937.pdf|Chiang & Lee 2023 - Can Large Language Models Be an Alternative to Human Evaluation?]] 
 +  * **[[https://arxiv.org/pdf/2306.05685.pdf|Zheng et al 2023 - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena]]** 
 +  * **[[https://arxiv.org/pdf/2404.13076|Panickssery et al 2024 - LLM Evaluators Recognize and Favor Their Own Generations]]** 
 +  * [[https://arxiv.org/pdf/2505.20738|Yuan et al 2025 - Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator]] 
 + 
 +===== Robust Evaluation ===== 
 +  * **[[https://arxiv.org/pdf/2005.04118.pdf|Ribeiro et al 2020 - Beyond Accuracy: Behavioral Testing of NLP Models with CheckList]]** Very good paper, best paper award at ACL 2020. 
 + 
 +See also Generation - [[Generation#Evaluation]], Machine Translation - [[Machine Translation#Evaluation]], and Dialog - [[Dialog#Evaluation]]. 
 + 
 +===== Related Pages ===== 
 +  * [[Experimental Method|Experimental Method and Reproducibility]] 
 +  * Natural Language Output 
 +    * Generation - [[Generation#Evaluation]] 
 +    * Machine Translation - [[Machine Translation#Evaluation]] 
 +    * Dialog - [[Dialog#Evaluation]] 
 +  * Question Answering - [[Question Answering#Evaluation]]
  
  
nlp/evaluation.1614372644.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki