User Tools

Site Tools


nlp:data_preparation

Data Preparation

Creating a Train/Dev/Test Split

Generally, you'll want use an existing train/dev/test split if it exists for that dataset so you can compare to previous methods. If the dataset doesn't have a split, it may not be a standard NLP dataset and it may be better to use a different dataset that is more widely used in experiments. If you need to create your train/dev/test split, here are some things to consider:

  • Avoid putting the same document in both the training, dev, and testing data. This means randomizing the documents that go into the train/dev/set split, not the sentences.
  • Do not use n-fold cross-validation across sentences. NLP data is highly non-iid because sentences in context are highly related to each other. Random splitting or n-fold cross-validation will over-estimate the performance of the method.
  • Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered. This setup is the most realistic setting for a deployed system.
  • Geva 2019 argues that test set annotators should be disjoint from training set annotators

Papers

Tokenization

nlp/data_preparation.txt · Last modified: 2023/06/15 07:36 by 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki