Differences

This shows you the differences between two versions of the page.

--- nlp:tokenization [2022/07/29 10:12] – [Software] jmflanig
+++ nlp:tokenization [2024/07/12 03:28] (current) – [Miscellaneous Papers about Tokenization] jmflanig
@@ Line 3: / Line 3: @@
 Tokenization usually has a large effect on the performance of a system.  When testing methods, it's good to use a reasonable baseline such as BPE for tokenization, and to keep tokenization the same when comparing systems.  It is common for improvements to tokenization to outweight possible model improvements.  In addition, mismatch between the tokenization used during training and testing can be detrimental to performance, which can be an issue when running off-the-shelf tools in downsteam applications.
+For a fixed size of vocabulary, the number of tokens per byte can give a measure of the quality of the tokenization (lower is better), see [[https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf|Jurassic-1 paper]] section 2.1.
+===== Introductions and Overviews =====
+  * [[https://www.nltk.org/book/ch03.html|NLTK Book Ch 3]] Section 3.3
+  * [[https://web.stanford.edu/~jurafsky/slp3/2.pdf|Jurafsky & Martin Ch 2]]
 ===== Traditional Tokenization =====
@@ Line 20: / Line 26: @@
     * Shared source and target BPE vocabulary usually helps, see [[https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt|best practice advice for byte pair encoding in nmt]]
     * [[https://arxiv.org/pdf/2004.03720.pdf|Bostrom & Durrett 2020 - Byte Pair Encoding is Suboptimal for Language Model Pretraining]]
-  * WordPiece (used in BERT): First described in [[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf|Schuster & Nakajima 2016]] and again in Sect 4.1 of [[https://arxiv.org/pdf/1609.08144.pdf|Wu 2016]]
+  * WordPiece (used in BERT)
+    * First described in [[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf|Schuster & Nakajima 2016]] and again in Sect 4.1 of [[https://arxiv.org/pdf/1609.08144.pdf|Wu 2016]]
+    * [[https://arxiv.org/pdf/2012.15524.pdf|Song et al 2020 - Fast WordPiece Tokenization]]
   * SentencePiece: [[https://aclanthology.org/D18-2012.pdf|Kudo, John Richardson 2018 - SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing]] Can apply to whole sentences, without doing an initial tokenization first. Used in [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]]: "The vocabulary is completely lossless and reversible, which means that whitespace is completely preserved in the vocabulary."
   * Subword Regularization [[https://arxiv.org/pdf/1804.10959.pdf|Kudo 2018 - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates]]
   * BPE Dropout [[https://arxiv.org/pdf/1910.13267.pdf|Provilkov et al 2020 - BPE-Dropout: Simple and Effective Subword Regularization]] Stochastically corrupts the segmentation procedure of BPE, which leads to multiple segmentations and prevents overfitting to the segmentation procedure. Up to 3 BLEU point improvement over BPE and 0.9 BLEU compared to previous subword regularization methods.
+  * Gradient-based Subword Tokenization: **[[https://arxiv.org/pdf/2106.12672.pdf|Tay et al 2021 - Charformer: Fast Character Transformers via Gradient-based Subword Tokenization]]**
+==== Effects and Choice of Tokenization ====
+  * [[https://arxiv.org/pdf/2310.08754|Ali et al 2024 - Tokenizer Choice For LLM Training: Negligible or Crucial?]] HuggingFace implementation of BPE seems to be suboptimal, see table 3.
+===== Miscellaneous Papers about Tokenization =====
+  * [[https://aclanthology.org/2024.eacl-long.72.pdf|Christopoulou et al 2024 - Text-to-Code Generation with Modality-relative Pre-training]] Adds domain-specific tokens to a pretrained LM for text-to-code
+  * [[https://arxiv.org/pdf/2406.20086|Feucht et al 2024 - Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs]] Finds the implicit vocabulary in a Transformer decoder model
 ===== Stopwords =====
@@ Line 34: / Line 49: @@
 ===== Software =====
-  * BPE: [[https://github.com/rsennrich/subword-nmt|Subword NMT]] Need to run a regular tokenizer (like the Moses tokenizer) first before running BPE
+  * BPE: [[https://github.com/rsennrich/subword-nmt|Subword NMT]] Need to run a regular tokenizer (like the [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl|Moses tokenizer]]) first before running BPE
-  * SentencePiece (does BPE and subword regularization): [[https://github.com/google/sentencepiece]] According to the docs, you don't need to run the Moses tokenizer first
+  * SentencePiece (does BPE and subword regularization): [[https://github.com/google/sentencepiece]] You don't need to run the Moses tokenizer first (see the [[https://aclanthology.org/D18-2012.pdf|paper]])
   * [[https://github.com/glample/fastBPE|fastBPE]]: A fast C++ implementation of BPE
+===== Related Pages =====
+  * [[Data Preparation]]