User Tools

Site Tools


nlp:tokenization

Tokenization

Tokenization is the process of splitting running text (which is a string of characters) into processing units, called tokens, which are usually either words or subword units.

Tokenization usually has a large effect on the performance of a system. When testing methods, it's good to use a reasonable baseline such as BPE for tokenization, and to keep tokenization the same when comparing systems. It is common for improvements to tokenization to outweight possible model improvements. In addition, mismatch between the tokenization used during training and testing can be detrimental to performance, which can be an issue when running off-the-shelf tools in downsteam applications.

For a fixed size of vocabulary, the number of tokens per byte can give a measure of the quality of the tokenization (lower is better), see Jurassic-1 paper section 2.1.

Introductions and Overviews

Traditional Tokenization

  • Stanford Core NLP has a Penn treebank style tokenizer
  • NLTK also has a tokenizer
  • spaCy has tokenizers
  • Moses
  • Workshop on Machine Translation (WMT)
    • Standard WMT evaluation script mteval-v13a.pl has it's own tokenization. This is the default tokenizer used in SacreBLEU

Subword Units

Effects and Choice of Tokenization

Miscellaneous Papers about Tokenization

Stopwords

For some tasks, it can be useful to filter out some words (stopwords).

Software

nlp/tokenization.txt · Last modified: 2024/07/12 03:28 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki