Differences

This shows you the differences between two versions of the page.

--- nlp:tokenization [2023/10/11 19:55] – [Subword Units] jmflanig
+++ nlp:tokenization [2024/07/12 03:28] (current) – [Miscellaneous Papers about Tokenization] jmflanig
@@ Line 34: / Line 34: @@
   * Gradient-based Subword Tokenization: **[[https://arxiv.org/pdf/2106.12672.pdf|Tay et al 2021 - Charformer: Fast Character Transformers via Gradient-based Subword Tokenization]]**
+==== Effects and Choice of Tokenization ====
+  * [[https://arxiv.org/pdf/2310.08754|Ali et al 2024 - Tokenizer Choice For LLM Training: Negligible or Crucial?]] HuggingFace implementation of BPE seems to be suboptimal, see table 3.
+===== Miscellaneous Papers about Tokenization =====
+  * [[https://aclanthology.org/2024.eacl-long.72.pdf|Christopoulou et al 2024 - Text-to-Code Generation with Modality-relative Pre-training]] Adds domain-specific tokens to a pretrained LM for text-to-code
+  * [[https://arxiv.org/pdf/2406.20086|Feucht et al 2024 - Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs]] Finds the implicit vocabulary in a Transformer decoder model
 ===== Stopwords =====
@@ Line 43: / Line 49: @@
 ===== Software =====
-  * BPE: [[https://github.com/rsennrich/subword-nmt|Subword NMT]] Need to run a regular tokenizer (like the Moses tokenizer) first before running BPE
+  * BPE: [[https://github.com/rsennrich/subword-nmt|Subword NMT]] Need to run a regular tokenizer (like the [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl|Moses tokenizer]]) first before running BPE
   * SentencePiece (does BPE and subword regularization): [[https://github.com/google/sentencepiece]] You don't need to run the Moses tokenizer first (see the [[https://aclanthology.org/D18-2012.pdf|paper]])
   * [[https://github.com/glample/fastBPE|fastBPE]]: A fast C++ implementation of BPE
+===== Related Pages =====
+  * [[Data Preparation]]