Differences

This shows you the differences between two versions of the page.

--- nlp:evaluation [2024/08/02 03:39] – [Evaluation with Large Language Models] jmflanig
+++ nlp:evaluation [2025/11/18 22:24] (current) – [Evaluation with Large Language Models] jmflanig
@@ Line 11: / Line 11: @@
 ===== Evaluation with Large Language Models =====
+  * **Overviews**
+    * [[https://arxiv.org/pdf/2411.15594|Gu et al 2024 - A Survey on LLM-as-a-Judge]]
+    * Blog: [[https://eugeneyan.com/writing/llm-evaluators/|2024 - Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)]]
+  * [[https://arxiv.org/pdf/2305.17926|Wang et al 2023 - Large Language Models are not Fair Evaluators]]
   * [[https://arxiv.org/pdf/2305.01937.pdf|Chiang & Lee 2023 - Can Large Language Models Be an Alternative to Human Evaluation?]]
   * **[[https://arxiv.org/pdf/2306.05685.pdf|Zheng et al 2023 - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena]]**
   * **[[https://arxiv.org/pdf/2404.13076|Panickssery et al 2024 - LLM Evaluators Recognize and Favor Their Own Generations]]**
+  * [[https://arxiv.org/pdf/2505.20738|Yuan et al 2025 - Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator]]
 ===== Robust Evaluation =====