This is an old revision of the document!

Data Preparation

Creating a Train/Dev/Test Split

Generally, you'll want use an existing train/dev/test split if it exists for that dataset so you can compare to previous methods. If the dataset doesn't have a split, it may not be a standard NLP dataset and it may be better to use a different dataset that is more widely used in experiments. If you need to create your train/dev/test split, here are some things to consider:

Avoid putting the same document in both the training, dev, and testing data. This means randomizing the documents that go into the train/dev/set split, not the sentences.
Do not use n-fold cross-validation across sentences. NLP data is highly non-iid because sentences in context are highly related to each other. Random splitting or n-fold cross-validation will over-estimate the performance of the model
Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered. This setup is the most realistic setting for a deployed system.