Tokenization

Tokenization

Tokenization is the process of splitting running text (which is a string of characters) into processing units, called tokens, which are usually either words or subword units.

Tokenization usually has a large effect on the performance of a system. When testing methods, it's good to use a reasonable baseline such as BPE for tokenization, and to keep tokenization the same when comparing systems. It is common for improvements to tokenization to outweight possible model improvements. In addition, mismatch between the tokenization used during training and testing can be detrimental to performance, which can be an issue when running off-the-shelf tools in downsteam applications.

For a fixed size of vocabulary, the number of tokens per byte can give a measure of the quality of the tokenization (lower is better), see Jurassic-1 paper section 2.1.

Introductions and Overviews

NLTK Book Ch 3 Section 3.3
Jurafsky & Martin Ch 2

Traditional Tokenization

Stanford Core NLP has a Penn treebank style tokenizer
NLTK also has a tokenizer
spaCy has tokenizers
Moses
- Tokenizer: tokenizer.perl (used, for example here)
- Detokenizer: detokenizer.perl
Workshop on Machine Translation (WMT)
- Standard WMT evaluation script mteval-v13a.pl has it's own tokenization. This is the default tokenizer used in SacreBLEU

Subword Units

Byte-Pair Encoding (BPE) Sennrich et al 2016 - Neural Machine Translation of Rare Words with Subword Units
- Blog post: Byte-Pair Encoding: Subword-based tokenization algorithm Warning: may contain mistakes or conceptual errors
- BPE Dropout Provilkov et al 2019 - BPE-Dropout: Simple and Effective Subword Regularization Shows a large improvement in BLEU (up to 3 points) when used in MT
- Shared source and target BPE vocabulary usually helps, see best practice advice for byte pair encoding in nmt
- Bostrom & Durrett 2020 - Byte Pair Encoding is Suboptimal for Language Model Pretraining
WordPiece (used in BERT)
- First described in Schuster & Nakajima 2016 and again in Sect 4.1 of Wu 2016
- Song et al 2020 - Fast WordPiece Tokenization
SentencePiece: Kudo, John Richardson 2018 - SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing Can apply to whole sentences, without doing an initial tokenization first. Used in PaLM: “The vocabulary is completely lossless and reversible, which means that whitespace is completely preserved in the vocabulary.”
Subword Regularization Kudo 2018 - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
BPE Dropout Provilkov et al 2020 - BPE-Dropout: Simple and Effective Subword Regularization Stochastically corrupts the segmentation procedure of BPE, which leads to multiple segmentations and prevents overfitting to the segmentation procedure. Up to 3 BLEU point improvement over BPE and 0.9 BLEU compared to previous subword regularization methods.
Gradient-based Subword Tokenization: Tay et al 2021 - Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Effects and Choice of Tokenization

Ali et al 2024 - Tokenizer Choice For LLM Training: Negligible or Crucial? HuggingFace implementation of BPE seems to be suboptimal, see table 3.

Miscellaneous Papers about Tokenization

Christopoulou et al 2024 - Text-to-Code Generation with Modality-relative Pre-training Adds domain-specific tokens to a pretrained LM for text-to-code
Feucht et al 2024 - Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs Finds the implicit vocabulary in a Transformer decoder model

Stopwords

For some tasks, it can be useful to filter out some words (stopwords).

Stopword lists
- NLTK's list of english stopwords
Papers
- Luo et al 2021 Stopwords in technical language processing github Has a general algorithm (and code) for creating stopword lists for new domains

Software

BPE: Subword NMT Need to run a regular tokenizer (like the Moses tokenizer) first before running BPE
SentencePiece (does BPE and subword regularization): https://github.com/google/sentencepiece You don't need to run the Moses tokenizer first (see the paper)
fastBPE: A fast C++ implementation of BPE

Data Preparation

Table of Contents