Table of Contents
Tokenization
Tokenization is the process of splitting running text (which is a string of characters) into processing units, called tokens, which are usually either words or subword units.
Tokenization usually has a large effect on the performance of a system. When testing methods, it's good to use a reasonable baseline such as BPE for tokenization, and to keep tokenization the same when comparing systems. It is common for improvements to tokenization to outweight possible model improvements. In addition, mismatch between the tokenization used during training and testing can be detrimental to performance, which can be an issue when running off-the-shelf tools in downsteam applications.
For a fixed size of vocabulary, the number of tokens per byte can give a measure of the quality of the tokenization (lower is better), see Jurassic-1 paper section 2.1.
Introductions and Overviews
- NLTK Book Ch 3 Section 3.3
Traditional Tokenization
- Stanford Core NLP has a Penn treebank style tokenizer
- NLTK also has a tokenizer
- spaCy has tokenizers
- Moses
- Tokenizer: tokenizer.perl (used, for example here)
- Detokenizer: detokenizer.perl
- Workshop on Machine Translation (WMT)
- Standard WMT evaluation script mteval-v13a.pl has it's own tokenization. This is the default tokenizer used in SacreBLEU
Subword Units
- Byte-Pair Encoding (BPE) Sennrich et al 2016 - Neural Machine Translation of Rare Words with Subword Units
- Blog post: Byte-Pair Encoding: Subword-based tokenization algorithm Warning: may contain mistakes or conceptual errors
- BPE Dropout Provilkov et al 2019 - BPE-Dropout: Simple and Effective Subword Regularization Shows a large improvement in BLEU (up to 3 points) when used in MT
- Shared source and target BPE vocabulary usually helps, see best practice advice for byte pair encoding in nmt
- WordPiece (used in BERT)
- First described in Schuster & Nakajima 2016 and again in Sect 4.1 of Wu 2016
- SentencePiece: Kudo, John Richardson 2018 - SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing Can apply to whole sentences, without doing an initial tokenization first. Used in PaLM: “The vocabulary is completely lossless and reversible, which means that whitespace is completely preserved in the vocabulary.”
- BPE Dropout Provilkov et al 2020 - BPE-Dropout: Simple and Effective Subword Regularization Stochastically corrupts the segmentation procedure of BPE, which leads to multiple segmentations and prevents overfitting to the segmentation procedure. Up to 3 BLEU point improvement over BPE and 0.9 BLEU compared to previous subword regularization methods.
- Gradient-based Subword Tokenization: Tay et al 2021 - Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
Effects and Choice of Tokenization
- Ali et al 2024 - Tokenizer Choice For LLM Training: Negligible or Crucial? HuggingFace implementation of BPE seems to be suboptimal, see table 3.
Miscellaneous Papers about Tokenization
- Christopoulou et al 2024 - Text-to-Code Generation with Modality-relative Pre-training Adds domain-specific tokens to a pretrained LM for text-to-code
- Feucht et al 2024 - Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs Finds the implicit vocabulary in a Transformer decoder model
Stopwords
For some tasks, it can be useful to filter out some words (stopwords).
- Stopword lists
- Papers
- Luo et al 2021 Stopwords in technical language processing github Has a general algorithm (and code) for creating stopword lists for new domains
Software
- BPE: Subword NMT Need to run a regular tokenizer (like the Moses tokenizer) first before running BPE
- SentencePiece (does BPE and subword regularization): https://github.com/google/sentencepiece You don't need to run the Moses tokenizer first (see the paper)
- fastBPE: A fast C++ implementation of BPE