User Tools

Site Tools


nlp:tokenization

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:tokenization [2022/07/29 19:32] – [Tokenization] jmflanignlp:tokenization [2024/07/12 03:28] (current) – [Miscellaneous Papers about Tokenization] jmflanig
Line 5: Line 5:
  
 For a fixed size of vocabulary, the number of tokens per byte can give a measure of the quality of the tokenization (lower is better), see [[https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf|Jurassic-1 paper]] section 2.1. For a fixed size of vocabulary, the number of tokens per byte can give a measure of the quality of the tokenization (lower is better), see [[https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf|Jurassic-1 paper]] section 2.1.
 +
 +===== Introductions and Overviews =====
 +  * [[https://www.nltk.org/book/ch03.html|NLTK Book Ch 3]] Section 3.3
 +  * [[https://web.stanford.edu/~jurafsky/slp3/2.pdf|Jurafsky & Martin Ch 2]]
  
 ===== Traditional Tokenization ===== ===== Traditional Tokenization =====
Line 28: Line 32:
   * Subword Regularization [[https://arxiv.org/pdf/1804.10959.pdf|Kudo 2018 - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates]]   * Subword Regularization [[https://arxiv.org/pdf/1804.10959.pdf|Kudo 2018 - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates]]
   * BPE Dropout [[https://arxiv.org/pdf/1910.13267.pdf|Provilkov et al 2020 - BPE-Dropout: Simple and Effective Subword Regularization]] Stochastically corrupts the segmentation procedure of BPE, which leads to multiple segmentations and prevents overfitting to the segmentation procedure. Up to 3 BLEU point improvement over BPE and 0.9 BLEU compared to previous subword regularization methods.   * BPE Dropout [[https://arxiv.org/pdf/1910.13267.pdf|Provilkov et al 2020 - BPE-Dropout: Simple and Effective Subword Regularization]] Stochastically corrupts the segmentation procedure of BPE, which leads to multiple segmentations and prevents overfitting to the segmentation procedure. Up to 3 BLEU point improvement over BPE and 0.9 BLEU compared to previous subword regularization methods.
 +  * Gradient-based Subword Tokenization: **[[https://arxiv.org/pdf/2106.12672.pdf|Tay et al 2021 - Charformer: Fast Character Transformers via Gradient-based Subword Tokenization]]**
  
 +==== Effects and Choice of Tokenization ====
 +  * [[https://arxiv.org/pdf/2310.08754|Ali et al 2024 - Tokenizer Choice For LLM Training: Negligible or Crucial?]] HuggingFace implementation of BPE seems to be suboptimal, see table 3.
 +
 +===== Miscellaneous Papers about Tokenization =====
 +  * [[https://aclanthology.org/2024.eacl-long.72.pdf|Christopoulou et al 2024 - Text-to-Code Generation with Modality-relative Pre-training]] Adds domain-specific tokens to a pretrained LM for text-to-code
 +  * [[https://arxiv.org/pdf/2406.20086|Feucht et al 2024 - Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs]] Finds the implicit vocabulary in a Transformer decoder model
  
 ===== Stopwords ===== ===== Stopwords =====
Line 38: Line 49:
  
 ===== Software ===== ===== Software =====
-  * BPE: [[https://github.com/rsennrich/subword-nmt|Subword NMT]] Need to run a regular tokenizer (like the Moses tokenizer) first before running BPE+  * BPE: [[https://github.com/rsennrich/subword-nmt|Subword NMT]] Need to run a regular tokenizer (like the [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl|Moses tokenizer]]) first before running BPE
   * SentencePiece (does BPE and subword regularization): [[https://github.com/google/sentencepiece]] You don't need to run the Moses tokenizer first (see the [[https://aclanthology.org/D18-2012.pdf|paper]])   * SentencePiece (does BPE and subword regularization): [[https://github.com/google/sentencepiece]] You don't need to run the Moses tokenizer first (see the [[https://aclanthology.org/D18-2012.pdf|paper]])
   * [[https://github.com/glample/fastBPE|fastBPE]]: A fast C++ implementation of BPE   * [[https://github.com/glample/fastBPE|fastBPE]]: A fast C++ implementation of BPE
 +
 +===== Related Pages =====
 +  * [[Data Preparation]]
  
nlp/tokenization.1659123174.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki