nlp:data_preparation
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| nlp:data_preparation [2021/03/01 22:15] – created jmflanig | nlp:data_preparation [2023/06/15 07:36] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 3: | Line 3: | ||
| ===== Creating a Train/ | ===== Creating a Train/ | ||
| Generally, you'll want use an existing train/ | Generally, you'll want use an existing train/ | ||
| - | * Avoid putting the same document in both the training, dev, and testing data. This means you probably want to randomize | + | * Avoid putting the same document in both the training, dev, and testing data. This means randomizing |
| - | * Do not use n-fold cross-validation across sentences. | + | * Do not use n-fold cross-validation across sentences. |
| * Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered. | * Sometimes it's a good idea to split by date, so you have train, dev, test data chronologically ordered. | ||
| + | * [[https:// | ||
| + | ==== Papers ==== | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * * [[https:// | ||
| + | |||
| + | |||
| + | ===== Tokenization ===== | ||
| + | See [[Tokenization]]. | ||
| + | |||
| + | ===== Related Pages ===== | ||
| + | * [[ml:Data Cleaning and Validation]] | ||
| + | * [[Dataset Creation]] | ||
| + | * [[Language Identification]] | ||
| + | * [[Tokenization]] | ||
nlp/data_preparation.1614636900.txt.gz · Last modified: 2023/06/15 07:36 (external edit)