====== Data Preparation ====== ===== Creating a Train/Dev/Test Split ===== Generally, you'll want use an existing train/dev/test split if it exists for that dataset so you can compare to previous methods. If the dataset doesn't have a split, it may not be a standard NLP dataset and it may be better to use a different dataset that is more widely used in experiments. If you need to create your train/dev/test split, here are some things to consider: * Avoid putting the same document in both the training, dev, and testing data. This means randomizing the //documents// that go into the train/dev/set split, not the sentences. * Do not use n-fold cross-validation across sentences. NLP data is highly non-iid because sentences in context are highly related to each other. Random splitting or n-fold cross-validation will over-estimate the performance of the method. * Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered. This setup is the most realistic setting for a deployed system. * [[https://arxiv.org/pdf/1908.07898.pdf|Geva 2019]] argues that test set annotators should be disjoint from training set annotators ==== Papers ==== * [[https://www.aclweb.org/anthology/P19-1267.pdf|Gorman & Bedrick 2019 - We need to talk about standard splits]] Bad paper, DO NOT USE. See [[https://www.aclweb.org/anthology/2021.eacl-main.156.pdf|Søgaard 2020]] below. * [[https://www.aclweb.org/anthology/2021.eacl-main.156.pdf|Søgaard et al 2020 - We Need to Talk About Random Splits]] Finds that both random and standard splits reward overfitting the training data. * * [[https://arxiv.org/pdf/1908.07898.pdf|Geva et al 2019 - Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets]] Argues that test set annotators should be disjoint from training set annotators. ===== Tokenization ===== See [[Tokenization]]. ===== Related Pages ===== * [[ml:Data Cleaning and Validation]] * [[Dataset Creation]] * [[Language Identification]] * [[Tokenization]]