Differences

This shows you the differences between two versions of the page.

--- nlp:tokenization [2022/07/29 10:16] – [Subword Units] jmflanig
+++ nlp:tokenization [2024/07/12 03:28] (current) – [Miscellaneous Papers about Tokenization] jmflanig
@@ Line 3: / Line 3: @@
 Tokenization usually has a large effect on the performance of a system.  When testing methods, it's good to use a reasonable baseline such as BPE for tokenization, and to keep tokenization the same when comparing systems.  It is common for improvements to tokenization to outweight possible model improvements.  In addition, mismatch between the tokenization used during training and testing can be detrimental to performance, which can be an issue when running off-the-shelf tools in downsteam applications.
+For a fixed size of vocabulary, the number of tokens per byte can give a measure of the quality of the tokenization (lower is better), see [[https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf|Jurassic-1 paper]] section 2.1.
+===== Introductions and Overviews =====
+  * [[https://www.nltk.org/book/ch03.html|NLTK Book Ch 3]] Section 3.3
+  * [[https://web.stanford.edu/~jurafsky/slp3/2.pdf|Jurafsky & Martin Ch 2]]
 ===== Traditional Tokenization =====
@@ Line 26: / Line 32: @@
   * Subword Regularization [[https://arxiv.org/pdf/1804.10959.pdf|Kudo 2018 - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates]]
   * BPE Dropout [[https://arxiv.org/pdf/1910.13267.pdf|Provilkov et al 2020 - BPE-Dropout: Simple and Effective Subword Regularization]] Stochastically corrupts the segmentation procedure of BPE, which leads to multiple segmentations and prevents overfitting to the segmentation procedure. Up to 3 BLEU point improvement over BPE and 0.9 BLEU compared to previous subword regularization methods.
+  * Gradient-based Subword Tokenization: **[[https://arxiv.org/pdf/2106.12672.pdf|Tay et al 2021 - Charformer: Fast Character Transformers via Gradient-based Subword Tokenization]]**
+==== Effects and Choice of Tokenization ====
+  * [[https://arxiv.org/pdf/2310.08754|Ali et al 2024 - Tokenizer Choice For LLM Training: Negligible or Crucial?]] HuggingFace implementation of BPE seems to be suboptimal, see table 3.
+===== Miscellaneous Papers about Tokenization =====
+  * [[https://aclanthology.org/2024.eacl-long.72.pdf|Christopoulou et al 2024 - Text-to-Code Generation with Modality-relative Pre-training]] Adds domain-specific tokens to a pretrained LM for text-to-code
+  * [[https://arxiv.org/pdf/2406.20086|Feucht et al 2024 - Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs]] Finds the implicit vocabulary in a Transformer decoder model
 ===== Stopwords =====
@@ Line 36: / Line 49: @@
 ===== Software =====
-  * BPE: [[https://github.com/rsennrich/subword-nmt|Subword NMT]] Need to run a regular tokenizer (like the Moses tokenizer) first before running BPE
+  * BPE: [[https://github.com/rsennrich/subword-nmt|Subword NMT]] Need to run a regular tokenizer (like the [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl|Moses tokenizer]]) first before running BPE
   * SentencePiece (does BPE and subword regularization): [[https://github.com/google/sentencepiece]] You don't need to run the Moses tokenizer first (see the [[https://aclanthology.org/D18-2012.pdf|paper]])
   * [[https://github.com/glample/fastBPE|fastBPE]]: A fast C++ implementation of BPE
+===== Related Pages =====
+  * [[Data Preparation]]