Differences

This shows you the differences between two versions of the page.

--- nlp:data_augmentation [2023/07/26 04:10] – [Synthetic Data Augmentation or Generation] jmflanig
+++ nlp:data_augmentation [2025/05/21 19:53] (current) – [LLM / Prompt-Based Data Augmentation or Generation] jmflanig
@@ Line 12: / Line 12: @@
   * [[https://arxiv.org/pdf/2011.09039.pdf|Guo et al 2020 - Sequence-Level Mixed Sample Data Augmentation]]
-=== Prompt-Based Data Augmentation or Generation ===
+===== LLM / Prompt-Based Data Augmentation or Generation =====
+Aka **synthetic data generation**.  For evaluation, see [[Evaluation#Evaluation with Large Language Models]].
+  * **Overviews**
+    * [[https://arxiv.org/pdf/2402.13446|Tan et al 2024 - Large Language Models for Data Annotation and Synthesis: A Survey]]
+    * **[[https://aclanthology.org/2024.findings-acl.658.pdf|Long et al 2024 - On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey]]** Really great survey, lots of practical advice
+    * [[https://arxiv.org/pdf/2411.04637|2024 - Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop]] [[https://docs.google.com/presentation/d/1vum7or5PqLCE6MbbH2KnJrwJ2uzLMEpc19M6Nzuy_0I/edit?slide=id.p#slide=id.p|slides]] from [[https://toloka.ai/events/toloka-ai-coling-2025-human-w-llm-tutorial|here]]
   * [[https://arxiv.org/pdf/2108.13487.pdf|Wang et al 2021 - Want To Reduce Labeling Cost? GPT-3 Can Help]]
   * [[https://arxiv.org/pdf/2202.12499.pdf|Wang et al 2022 - PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks]]
   * [[https://arxiv.org/pdf/2212.10450.pdf|Ding et al 2022 - Is GPT-3 a Good Data Annotator?]]
   * [[https://arxiv.org/pdf/2211.03044.pdf|Meng et al 2022 - Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning]]
+  * [[https://arxiv.org/pdf/2311.09807|Guo et al 2023 - The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text]]
+  * [[https://arxiv.org/pdf/2502.07164|Zhang & Pavlick 2024 - Does Training on Synthetic Data Make Models Less Robust?]]
+  * [[https://arxiv.org/pdf/2502.14678|Patel et al 2025 - How to Get Your LLM to Generate Challenging Problems for Evaluation]]
 ===== Related Pages =====
+  * [[Crowdsourcing]]
   * [[Dataset Creation]]
   * [[ml:Data Augmentation|ML - Data Augmentation]]
   * [[nlp:semantic_parsing#data_augmentation|Semantic Parsing - Data Augmentation]]