nlp:data_preparation
This is an old revision of the document!
Table of Contents
Data Preparation
Creating a Train/Dev/Test Split
Generally, you'll want use an existing train/dev/test split if it exists for that dataset so you can compare to previous methods. If the dataset doesn't have a split, it may not be a standard NLP dataset and it may be better to use a different dataset that is more widely used in experiments. If you need to create your train/dev/test split, here are some things to consider:
- Avoid putting the same document in both the training, dev, and testing data. This means randomizing the documents that go into the train/dev/set split, not the sentences.
- Do not use n-fold cross-validation across sentences. NLP data is highly non-iid because sentences in context are highly related to each other. Random splitting or n-fold cross-validation will over-estimate the performance of the method.
- Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered. This setup is the most realistic setting for a deployed system.
Papers
- Gorman & Bedrick 2019 - We need to talk about standard splits Bad paper, DO NOT USE. Random splits are bad for NLP since data is highly dependent and is not IID. Random splits rewards overfitting the data. See Søgaard 2020 below.
Tokenization
See Tokenization.
Related Pages
nlp/data_preparation.1618957613.txt.gz · Last modified: 2023/06/15 07:36 (external edit)