nlp:evaluation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:evaluation [2021/04/29 07:45] jmflanignlp:evaluation [2025/11/18 22:24] (current) – [Evaluation with Large Language Models] jmflanig
Line 1: Line 1:
 ====== Evaluation ====== ====== Evaluation ======
- 
-===== Robust Evaluation ===== 
-  * **[[https://arxiv.org/pdf/2005.04118.pdf|Ribeiro et al 2020 - Beyond Accuracy: Behavioral Testing of NLP Models with CheckList]]** Very good paper, best paper award at ACL 2020. 
  
 ===== Natural Language Output ===== ===== Natural Language Output =====
Line 8: Line 5:
  
 See also Generation - [[Generation#Evaluation]], Machine Translation - [[Machine Translation#Evaluation]], and Dialog - [[Dialog#Evaluation]]. See also Generation - [[Generation#Evaluation]], Machine Translation - [[Machine Translation#Evaluation]], and Dialog - [[Dialog#Evaluation]].
 +
 +===== Papers =====
 +  * [[https://aclanthology.org/2021.acl-long.346.pdf|Rodriguez et al 2021 - Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards?]]
 +
 +
 +===== Evaluation with Large Language Models =====
 +  * **Overviews**
 +    * [[https://arxiv.org/pdf/2411.15594|Gu et al 2024 - A Survey on LLM-as-a-Judge]]
 +    * Blog: [[https://eugeneyan.com/writing/llm-evaluators/|2024 - Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)]]
 +  * [[https://arxiv.org/pdf/2305.17926|Wang et al 2023 - Large Language Models are not Fair Evaluators]]
 +  * [[https://arxiv.org/pdf/2305.01937.pdf|Chiang & Lee 2023 - Can Large Language Models Be an Alternative to Human Evaluation?]]
 +  * **[[https://arxiv.org/pdf/2306.05685.pdf|Zheng et al 2023 - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena]]**
 +  * **[[https://arxiv.org/pdf/2404.13076|Panickssery et al 2024 - LLM Evaluators Recognize and Favor Their Own Generations]]**
 +  * [[https://arxiv.org/pdf/2505.20738|Yuan et al 2025 - Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator]]
 +
 +===== Robust Evaluation =====
 +  * **[[https://arxiv.org/pdf/2005.04118.pdf|Ribeiro et al 2020 - Beyond Accuracy: Behavioral Testing of NLP Models with CheckList]]** Very good paper, best paper award at ACL 2020.
 +
 +See also Generation - [[Generation#Evaluation]], Machine Translation - [[Machine Translation#Evaluation]], and Dialog - [[Dialog#Evaluation]].
 +
 +===== Related Pages =====
 +  * [[Experimental Method|Experimental Method and Reproducibility]]
 +  * Natural Language Output
 +    * Generation - [[Generation#Evaluation]]
 +    * Machine Translation - [[Machine Translation#Evaluation]]
 +    * Dialog - [[Dialog#Evaluation]]
 +  * Question Answering - [[Question Answering#Evaluation]]
  
  
nlp/evaluation.1619682349.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki