====== Data Augmentation in NLP ====== ===== Overviews ===== * [[https://arxiv.org/pdf/2105.03075.pdf|Feng et al 2021 - A Survey of Data Augmentation Approaches for NLP]] ===== Papers ===== * [[https://arxiv.org/pdf/2009.13818.pdf|Shen et al 2020 - A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation]] ===== Synthetic Data Augmentation or Generation ===== * [[https://arxiv.org/pdf/1606.03622.pdf|Jia & Liang 2016 - Data Recombination for Neural Semantic Parsing]] * [[https://arxiv.org/pdf/1904.09545.pdf|Andreas 2019 - Good-Enough Compositional Data Augmentation]] GECA method * [[https://arxiv.org/pdf/2011.09039.pdf|Guo et al 2020 - Sequence-Level Mixed Sample Data Augmentation]] ===== LLM / Prompt-Based Data Augmentation or Generation ===== Aka **synthetic data generation**. For evaluation, see [[Evaluation#Evaluation with Large Language Models]]. * **Overviews** * [[https://arxiv.org/pdf/2402.13446|Tan et al 2024 - Large Language Models for Data Annotation and Synthesis: A Survey]] * **[[https://aclanthology.org/2024.findings-acl.658.pdf|Long et al 2024 - On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey]]** Really great survey, lots of practical advice * [[https://arxiv.org/pdf/2411.04637|2024 - Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop]] [[https://docs.google.com/presentation/d/1vum7or5PqLCE6MbbH2KnJrwJ2uzLMEpc19M6Nzuy_0I/edit?slide=id.p#slide=id.p|slides]] from [[https://toloka.ai/events/toloka-ai-coling-2025-human-w-llm-tutorial|here]] * [[https://arxiv.org/pdf/2108.13487.pdf|Wang et al 2021 - Want To Reduce Labeling Cost? GPT-3 Can Help]] * [[https://arxiv.org/pdf/2202.12499.pdf|Wang et al 2022 - PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks]] * [[https://arxiv.org/pdf/2212.10450.pdf|Ding et al 2022 - Is GPT-3 a Good Data Annotator?]] * [[https://arxiv.org/pdf/2211.03044.pdf|Meng et al 2022 - Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning]] * [[https://arxiv.org/pdf/2311.09807|Guo et al 2023 - The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text]] * [[https://arxiv.org/pdf/2502.07164|Zhang & Pavlick 2024 - Does Training on Synthetic Data Make Models Less Robust?]] * [[https://arxiv.org/pdf/2502.14678|Patel et al 2025 - How to Get Your LLM to Generate Challenging Problems for Evaluation]] ===== Related Pages ===== * [[Crowdsourcing]] * [[Dataset Creation]] * [[ml:Data Augmentation|ML - Data Augmentation]] * [[nlp:semantic_parsing#data_augmentation|Semantic Parsing - Data Augmentation]]