This is an old revision of the document!

Data Preparation

Creating a Train/Dev/Test Split

Generally, you'll want use an existing train/dev/test split if it exists for that dataset so you can compare to previous methods. If the dataset doesn't have a split, it may not be a standard NLP dataset and it may be better to use a different dataset that is more widely used in experiments. If you need to create your train/dev/test split, here are some things to consider:

Avoid putting the same document in both the training, dev, and testing data. This means randomizing the documents that go into the train/dev/set split, not the sentences.
Do not use n-fold cross-validation across sentences. NLP data is highly non-iid because sentences in context are highly related to each other. Random splitting or n-fold cross-validation will over-estimate the performance of the method.
Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered. This setup is the most realistic setting for a deployed system.

Papers

Gorman & Bedrick 2019 - We need to talk about standard splits Bad paper, DO NOT USE. Random splits are bad for NLP since data is highly dependent and is not IID. Random splits rewards overfitting the data. See Søgaard 2020 below.
Søgaard et al 2020 - We Need to Talk About Random Splits

Tokenization

See Tokenization.

NLP Wiki

Table of Contents

Data Preparation

Creating a Train/Dev/Test Split

Papers

Tokenization

Related Pages