Table of Contents
Data Augmentation in NLP
Overviews
Papers
Synthetic Data Augmentation or Generation
LLM / Prompt-Based Data Augmentation or Generation
Related Pages
Data Augmentation in NLP
Overviews
Feng et al 2021 - A Survey of Data Augmentation Approaches for NLP
Papers
Shen et al 2020 - A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation
Synthetic Data Augmentation or Generation
Jia & Liang 2016 - Data Recombination for Neural Semantic Parsing
Andreas 2019 - Good-Enough Compositional Data Augmentation
GECA method
Guo et al 2020 - Sequence-Level Mixed Sample Data Augmentation
LLM / Prompt-Based Data Augmentation or Generation
Aka
synthetic data generation
. For evaluation, see
Evaluation with Large Language Models
.
Overviews
Tan et al 2024 - Large Language Models for Data Annotation and Synthesis: A Survey
Long et al 2024 - On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
Really great survey, lots of practical advice
2024 - Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop
slides
from
here
Wang et al 2021 - Want To Reduce Labeling Cost? GPT-3 Can Help
Wang et al 2022 - PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks
Ding et al 2022 - Is GPT-3 a Good Data Annotator?
Meng et al 2022 - Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning
Guo et al 2023 - The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
Zhang & Pavlick 2024 - Does Training on Synthetic Data Make Models Less Robust?
Patel et al 2025 - How to Get Your LLM to Generate Challenging Problems for Evaluation
Related Pages
Crowdsourcing
Dataset Creation
ML - Data Augmentation
Semantic Parsing - Data Augmentation