User Tools

Site Tools


nlp:data_preparation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:data_preparation [2021/04/20 22:26] – [Papers] jmflanignlp:data_preparation [2023/06/15 07:36] (current) – external edit 127.0.0.1
Line 6: Line 6:
   * Do not use n-fold cross-validation across sentences.  NLP data is highly non-iid because sentences in context are highly related to each other.  Random splitting or n-fold cross-validation will over-estimate the performance of the method.   * Do not use n-fold cross-validation across sentences.  NLP data is highly non-iid because sentences in context are highly related to each other.  Random splitting or n-fold cross-validation will over-estimate the performance of the method.
   * Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered.  This setup is the most realistic setting for a deployed system.   * Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered.  This setup is the most realistic setting for a deployed system.
 +  * [[https://arxiv.org/pdf/1908.07898.pdf|Geva 2019]] argues that test set annotators should be disjoint from training set annotators
  
 ==== Papers ==== ==== Papers ====
-  * [[https://www.aclweb.org/anthology/P19-1267.pdf|Gorman & Bedrick 2019 - We need to talk about standard splits]] Bad paper, DO NOT USE.  Random splits are bad for NLP since data is highly dependent and is not IID.  Random splits rewards overfitting the data.  See [[https://www.aclweb.org/anthology/2021.eacl-main.156.pdf|Søgaard 2020]] below. +  * [[https://www.aclweb.org/anthology/P19-1267.pdf|Gorman & Bedrick 2019 - We need to talk about standard splits]] Bad paper, DO NOT USE.  See [[https://www.aclweb.org/anthology/2021.eacl-main.156.pdf|Søgaard 2020]] below. 
-  * [[https://www.aclweb.org/anthology/2021.eacl-main.156.pdf|Søgaard et al 2020 - We Need to Talk About Random Splits]]+  * [[https://www.aclweb.org/anthology/2021.eacl-main.156.pdf|Søgaard et al 2020 - We Need to Talk About Random Splits]] Finds that both random and standard splits reward overfitting the training data. 
 +  *   * [[https://arxiv.org/pdf/1908.07898.pdf|Geva et al 2019 - Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets]]  Argues that test set annotators should be disjoint from training set annotators.
  
  
Line 16: Line 18:
  
 ===== Related Pages ===== ===== Related Pages =====
 +  * [[ml:Data Cleaning and Validation]]
   * [[Dataset Creation]]   * [[Dataset Creation]]
   * [[Language Identification]]   * [[Language Identification]]
nlp/data_preparation.1618957613.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki