Table of Contents

Data Augmentation in NLP

Data Augmentation in NLP

Overviews

Feng et al 2021 - A Survey of Data Augmentation Approaches for NLP

Papers

Shen et al 2020 - A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation

Synthetic Data Augmentation or Generation

LLM / Prompt-Based Data Augmentation or Generation

Aka synthetic data generation. For evaluation, see Evaluation with Large Language Models.

Overviews
- Tan et al 2024 - Large Language Models for Data Annotation and Synthesis: A Survey
- Long et al 2024 - On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey Really great survey, lots of practical advice
- 2024 - Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop slides from here
Wang et al 2021 - Want To Reduce Labeling Cost? GPT-3 Can Help
Wang et al 2022 - PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks
Ding et al 2022 - Is GPT-3 a Good Data Annotator?
Meng et al 2022 - Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning
Guo et al 2023 - The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
Zhang & Pavlick 2024 - Does Training on Synthetic Data Make Models Less Robust?
Patel et al 2025 - How to Get Your LLM to Generate Challenging Problems for Evaluation

Related Pages