nlp:tokenization
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| nlp:tokenization [2022/07/29 10:12] – [Software] jmflanig | nlp:tokenization [2024/07/12 03:28] (current) – [Miscellaneous Papers about Tokenization] jmflanig | ||
|---|---|---|---|
| Line 3: | Line 3: | ||
| Tokenization usually has a large effect on the performance of a system. | Tokenization usually has a large effect on the performance of a system. | ||
| + | |||
| + | For a fixed size of vocabulary, the number of tokens per byte can give a measure of the quality of the tokenization (lower is better), see [[https:// | ||
| + | |||
| + | ===== Introductions and Overviews ===== | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| ===== Traditional Tokenization ===== | ===== Traditional Tokenization ===== | ||
| Line 20: | Line 26: | ||
| * Shared source and target BPE vocabulary usually helps, see [[https:// | * Shared source and target BPE vocabulary usually helps, see [[https:// | ||
| * [[https:// | * [[https:// | ||
| - | * WordPiece (used in BERT): First described in [[https:// | + | * WordPiece (used in BERT) |
| + | * First described in [[https:// | ||
| + | * [[https:// | ||
| * SentencePiece: | * SentencePiece: | ||
| * Subword Regularization [[https:// | * Subword Regularization [[https:// | ||
| * BPE Dropout [[https:// | * BPE Dropout [[https:// | ||
| + | * Gradient-based Subword Tokenization: | ||
| + | ==== Effects and Choice of Tokenization ==== | ||
| + | * [[https:// | ||
| + | |||
| + | ===== Miscellaneous Papers about Tokenization ===== | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| ===== Stopwords ===== | ===== Stopwords ===== | ||
| Line 34: | Line 49: | ||
| ===== Software ===== | ===== Software ===== | ||
| - | * BPE: [[https:// | + | * BPE: [[https:// |
| - | * SentencePiece (does BPE and subword regularization): | + | * SentencePiece (does BPE and subword regularization): |
| * [[https:// | * [[https:// | ||
| + | |||
| + | ===== Related Pages ===== | ||
| + | * [[Data Preparation]] | ||
nlp/tokenization.1659089540.txt.gz · Last modified: 2023/06/15 07:36 (external edit)