nlp:tokenization
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| nlp:tokenization [2023/10/11 19:55] – [Subword Units] jmflanig | nlp:tokenization [2024/07/12 03:28] (current) – [Miscellaneous Papers about Tokenization] jmflanig | ||
|---|---|---|---|
| Line 34: | Line 34: | ||
| * Gradient-based Subword Tokenization: | * Gradient-based Subword Tokenization: | ||
| + | ==== Effects and Choice of Tokenization ==== | ||
| + | * [[https:// | ||
| + | |||
| + | ===== Miscellaneous Papers about Tokenization ===== | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| ===== Stopwords ===== | ===== Stopwords ===== | ||
| Line 43: | Line 49: | ||
| ===== Software ===== | ===== Software ===== | ||
| - | * BPE: [[https:// | + | * BPE: [[https:// |
| * SentencePiece (does BPE and subword regularization): | * SentencePiece (does BPE and subword regularization): | ||
| * [[https:// | * [[https:// | ||
| + | |||
| + | ===== Related Pages ===== | ||
| + | * [[Data Preparation]] | ||
nlp/tokenization.1697054155.txt.gz · Last modified: 2023/10/11 19:55 by jmflanig