nlp:data_augmentation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:data_augmentation [2023/07/26 04:10] – [Synthetic Data Augmentation or Generation] jmflanignlp:data_augmentation [2025/05/21 19:53] (current) – [LLM / Prompt-Based Data Augmentation or Generation] jmflanig
Line 12: Line 12:
   * [[https://arxiv.org/pdf/2011.09039.pdf|Guo et al 2020 - Sequence-Level Mixed Sample Data Augmentation]]   * [[https://arxiv.org/pdf/2011.09039.pdf|Guo et al 2020 - Sequence-Level Mixed Sample Data Augmentation]]
  
-=== Prompt-Based Data Augmentation or Generation ===+===== LLM / Prompt-Based Data Augmentation or Generation ===== 
 +Aka **synthetic data generation**.  For evaluation, see [[Evaluation#Evaluation with Large Language Models]]. 
 +  * **Overviews** 
 +    * [[https://arxiv.org/pdf/2402.13446|Tan et al 2024 - Large Language Models for Data Annotation and Synthesis: A Survey]] 
 +    * **[[https://aclanthology.org/2024.findings-acl.658.pdf|Long et al 2024 - On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey]]** Really great survey, lots of practical advice 
 +    * [[https://arxiv.org/pdf/2411.04637|2024 - Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop]] [[https://docs.google.com/presentation/d/1vum7or5PqLCE6MbbH2KnJrwJ2uzLMEpc19M6Nzuy_0I/edit?slide=id.p#slide=id.p|slides]] from [[https://toloka.ai/events/toloka-ai-coling-2025-human-w-llm-tutorial|here]]
   * [[https://arxiv.org/pdf/2108.13487.pdf|Wang et al 2021 - Want To Reduce Labeling Cost? GPT-3 Can Help]]   * [[https://arxiv.org/pdf/2108.13487.pdf|Wang et al 2021 - Want To Reduce Labeling Cost? GPT-3 Can Help]]
   * [[https://arxiv.org/pdf/2202.12499.pdf|Wang et al 2022 - PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks]]   * [[https://arxiv.org/pdf/2202.12499.pdf|Wang et al 2022 - PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks]]
   * [[https://arxiv.org/pdf/2212.10450.pdf|Ding et al 2022 - Is GPT-3 a Good Data Annotator?]]   * [[https://arxiv.org/pdf/2212.10450.pdf|Ding et al 2022 - Is GPT-3 a Good Data Annotator?]]
   * [[https://arxiv.org/pdf/2211.03044.pdf|Meng et al 2022 - Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning]]   * [[https://arxiv.org/pdf/2211.03044.pdf|Meng et al 2022 - Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning]]
 +  * [[https://arxiv.org/pdf/2311.09807|Guo et al 2023 - The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text]]
 +  * [[https://arxiv.org/pdf/2502.07164|Zhang & Pavlick 2024 - Does Training on Synthetic Data Make Models Less Robust?]]
 +  * [[https://arxiv.org/pdf/2502.14678|Patel et al 2025 - How to Get Your LLM to Generate Challenging Problems for Evaluation]]
  
 ===== Related Pages ===== ===== Related Pages =====
 +  * [[Crowdsourcing]]
   * [[Dataset Creation]]   * [[Dataset Creation]]
   * [[ml:Data Augmentation|ML - Data Augmentation]]   * [[ml:Data Augmentation|ML - Data Augmentation]]
   * [[nlp:semantic_parsing#data_augmentation|Semantic Parsing - Data Augmentation]]   * [[nlp:semantic_parsing#data_augmentation|Semantic Parsing - Data Augmentation]]
  
nlp/data_augmentation.1690344639.txt.gz · Last modified: 2023/07/26 04:10 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki