Data Preparation

Creating a Train/Dev/Test Split

Generally, you'll want use an existing train/dev/test split if it exists for that dataset so you can compare to previous methods. If the dataset doesn't have a split, it may not be a standard NLP dataset and it may be better to use a different dataset that is more widely used in experiments. If you need to create your train/dev/test split, here are some things to consider:

Avoid putting the same document in both the training, dev, and testing data. This means randomizing the documents that go into the train/dev/set split, not the sentences.
Do not use n-fold cross-validation across sentences. NLP data is highly non-iid because sentences in context are highly related to each other. Random splitting or n-fold cross-validation will over-estimate the performance of the method.
Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered. This setup is the most realistic setting for a deployed system.
Geva 2019 argues that test set annotators should be disjoint from training set annotators

Papers

Gorman & Bedrick 2019 - We need to talk about standard splits Bad paper, DO NOT USE. See Søgaard 2020 below.
Søgaard et al 2020 - We Need to Talk About Random Splits Finds that both random and standard splits reward overfitting the training data.
* Geva et al 2019 - Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets Argues that test set annotators should be disjoint from training set annotators.

Tokenization

See Tokenization.

NLP Wiki

Table of Contents

Data Preparation

Creating a Train/Dev/Test Split

Papers

Tokenization

Related Pages