====== Pretraining ======

===== Overviews =====
See also [[language_model#overviews|Language Model - Overviews]].
  * **[[https://arxiv.org/pdf/2003.07278.pdf|Liu et al 2020 - A Survey on Contextual Embeddings]]**
  * [[https://arxiv.org/pdf/2003.08271.pdf|Qiu et al 2020 - Pre-trained Models for Natural Language Processing: A Survey]] Nice tables of pretraining methods on page 9 and 10, see [[Pretraining#Taxonomy of Pretraining Methods]] below.
  * **[[https://arxiv.org/pdf/2303.18223.pdf|Zhao et al 2023 - A Survey of Large Language Models]]**

===== Key and Early Papers =====
For a history, see section 2.4 of [[https://arxiv.org/pdf/2003.08271.pdf|Qiu 2020]] or the related work in the [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|GPT-2 paper]].
  * [[https://arxiv.org/pdf/1103.0398|Collobert et al 2011 - Natural Language Processing (almost) from Scratch]]
  * [[https://arxiv.org/pdf/1506.06726|Kiros et al 2015 - Skip-Thought Vectors]]
  * [[https://arxiv.org/pdf/1511.01432.pdf|Dai et al 2015 - Semi-supervised Sequence Learning]]
  * [[https://arxiv.org/pdf/1705.00108|Peters et al 2017 - Semi-supervised Sequence Tagging with Bidirectional Language Models]]
  * [[https://arxiv.org/pdf/1611.02683.pdf|Ramachandran et al 2017 - Unsupervised Pretraining for Sequence to Sequence Learning]]
  * [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]]
  * [[https://arxiv.org/pdf/1810.04805.pdf|Devlin et al 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]]
  * [[https://arxiv.org/pdf/2004.10964.pdf|Gururangan et al 2020 - Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks]]
  * [[https://arxiv.org/pdf/2011.04946.pdf|Zhang et al 2020 - When Do You Need Billions of Words of Pretraining Data?]]
  * [[https://arxiv.org/pdf/2209.14389.pdf|Krishna et al 2022 - Downstream Datasets Make Surprisingly Good Pretraining Corpora]]
  * [[https://arxiv.org/pdf/2305.06677.pdf|Renduchintala et al 2023 - INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models]]
  * [[https://arxiv.org/pdf/2305.12816.pdf|Wang et al 2023 - Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model]]

===== Contextualized Pretrained Models ======
Papers sorted chronologically.  For a large list of pre-trained models, see [[https://github.com/huggingface/transformers/tree/master/src/transformers/models|here]].
  * CoVe: [[https://arxiv.org/pdf/1708.00107.pdf|McCann et al 2017 - Learned in Translation: Contextualized Word Vectors]]
  * ULMFiT: [[https://arxiv.org/pdf/1801.06146|Howard & Ruder 2018 - Universal Language Model Fine-tuning for Text Classification]]
  * ELMO: [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]]
  * GPT: [[https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf|Radford et al 2018 - Improving Language Understanding by Generative Pre-Training]]
  * BERT: [[https://arxiv.org/pdf/1810.04805.pdf|Devlin et al 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]] [[https://github.com/google-research/bert|original github]]
  * XLM: [[https://arxiv.org/pdf/1901.07291.pdf|Lample et al 2019 - Cross-lingual Language Model Pretraining]]
  * GPT-2: [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|Radford et al 2019 - Language Models are Unsupervised Multitask Learners]] [[https://github.com/openai/gpt-2|original github]] [[https://amaarora.github.io/2020/02/18/annotatedGPT2.html|Annotated GPT-2]] [[https://jalammar.github.io/illustrated-gpt2/|Illustrated GPT-2]] Interestingly, GPT-2 does //not// include a bias term in the final linear layer for the vocab, see [[https://github.com/openai/gpt-2/blob/master/src/model.py#L171|here]] and [[https://github.com/huggingface/transformers/blob/v4.19.2/src/transformers/models/gpt2/modeling_gpt2.py#L951|here]].
  * MASS: [[https://arxiv.org/pdf/1905.02450.pdf|Song et al 2019 - MASS: Masked Sequence to Sequence Pre-training for Language Generation]]
  * XLNet: [[https://arxiv.org/pdf/1906.08237.pdf|Yang et al 2019 - XLNet: Generalized Autoregressive Pretraining for Language Understanding]]
  * RoBERTa: [[https://arxiv.org/pdf/1907.11692.pdf|Liu et al 2019 - RoBERTa: A Robustly Optimized BERT Pretraining Approach]]
  * CTRL: [[https://arxiv.org/pdf/1909.05858.pdf|Keskar et al 2019 - CTRL: A Conditional Transformer Language Model for Controllable Generation]]
  * ALBERT: [[https://arxiv.org/pdf/1909.11942.pdf|Lan et al 2019 - ALBERT: A Lite BERT for Self-supervised Learning of Language Representations]]
  * T5: [[https://arxiv.org/pdf/1910.10683.pdf|Raffel et al 2019 - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer]]
  * BART: [[https://arxiv.org/pdf/1910.13461.pdf|Lewis et al 2020 - BART: Denoising Sequence-to-Sequence Pre-training for NaturalLanguage Generation, Translation, and Comprehension]]
  * XLM-R: [[https://arxiv.org/pdf/1911.02116.pdf|Conneau et al 2019 - Unsupervised Cross-lingual Representation Learning at Scale]]
  * ELECTRA: [[https://arxiv.org/pdf/2003.10555.pdf|Clark et al 2020 - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators]]
  * Longformer: [[https://arxiv.org/pdf/2004.05150.pdf|Beltagy et al 2020 - Longformer: The Long-Document Transformer]]
  * MPNet: [[https://arxiv.org/pdf/2004.09297.pdf|Song et al 2020 - MPNet: Masked and Permuted Pre-training for Language Understanding]]
  * GPT-3: [[https://arxiv.org/pdf/2005.14165.pdf|Brown et al 2020 - Language Models are Few-Shot Learners]]
  * DeBERTa: [[https://arxiv.org/pdf/2006.03654.pdf|He et al 2020 - DeBERTa: Decoding-enhanced BERT with Disentangled Attention]] "Improves the BERT and RoBERTa models using two novel techniques."
  * MARGE: [[https://arxiv.org/pdf/2006.15020.pdf|Lewis 2020 - Pre-training via Paraphrasing]]
  * BigBird: [[https://arxiv.org/pdf/2007.14062.pdf|Zaheer et al 2020 - Big Bird: Transformers for Longer Sequences]]
  * ConvBERT: [[https://arxiv.org/pdf/2008.02496.pdf|Jiang et al 2020 - ConvBERT: Improving BERT with Span-based Dynamic Convolution]]
  * Switch Transformer: [[https://arxiv.org/pdf/2101.03961.pdf|Fedus et al 2021 - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity]]
  * Muppet: [[https://arxiv.org/pdf/2101.11038.pdf|Aghajanyan et al 2021 - Muppet: Massive Multi-task Representations with Pre-Finetuning]]
  * [[https://arxiv.org/pdf/2105.03322.pdf|Tay et al 2021 - Are Pre-trained Convolutions Better than Pre-trained Transformers?]]
  * XLM-E: [[https://aclanthology.org/2022.acl-long.427.pdf|Chi et al 2022 - LM-E: Cross-lingual Language Model Pre-training via ELECTRA]]
  * InstructGPT: [[https://arxiv.org/pdf/2203.02155.pdf|Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback]]

===== Table of Large Models =====
List of popular models in chronological order.  See also the list of [[nlp:language_model#Large Language Models]].

^ Model ^ Year ^ Type ^ Parameters ^ Training Data ^ Objective ^ Public? ^ Notes ^ Link ^
| BERT | 2018 | Dec | | | | | | |
| [[https://arxiv.org/pdf/1910.10683.pdf|T5]] | 2019 | Enc-Dec| 11B | [[https://www.tensorflow.org/datasets/catalog/c4|C4]] | | Yes | | [[https://github.com/google-research/text-to-text-transfer-transformer|github]] |
| BART | | Enc-Dec | | | | | | |
| | | | | | | | | |

===== Fine-Tuning Methods =====
Moved to [[ml:Fine-Tuning]].

===== Other Papers =====
  * [[https://arxiv.org/pdf/2108.07258.pdf|Bommasani et al 2021 - On the Opportunities and Risks of
Foundation Models]] Talks about the benefits and potential issues with pretrained models.
  * [[https://arxiv.org/pdf/2206.10139.pdf|Wu et al 2022 - Insights into Pre-training via Simpler Synthetic Tasks]]

===== Complex Pre-training Methods =====
  * [[https://arxiv.org/pdf/2212.10449.pdf|Pagnoni et 2022 -  Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization]] - Pre-training on generated questions

===== Taxonomy of Pretraining Methods =====

{{media:pretraining_taxonomy.png}} \\
Figure from [[https://arxiv.org/pdf/2003.08271.pdf|Qiu 2020]].

{{media:pretrain_models.png}} \\
Figure from [[https://arxiv.org/pdf/2003.07278.pdf|Liu 2020]].

{{media:pretrain_models_2.png}} \\
Figure from [[https://arxiv.org/pdf/2003.08271.pdf|Qiu 2020]].

{{media:pretrain_objectives.png}} \\
Figure from [[https://arxiv.org/pdf/2003.07278.pdf|Liu 2020]].  Key:
  * LM: language modeling
  * MLM: masked language modeling
  * NSP: next sentence prediction
  * SOP: sentence order prediction
  * Discriminator (o/r): predict for each word if it was replaced (r ) or not (o, original)
  * seq2seq LM: given a prefix of words in a sequence, predict the rest of the sequence
  * Span Mask: predict masked words, where the masked words are contiguous (a span)
  * Text Infilling: Spans of words are replaced with a single mask token. Must predict all the words in the masked span.
  * Sent shuffling: Unshuffle a shuffled sentence
  * TLM: (Translation Language Modeling)  Tokens in both source and target sequences are masked for learning cross-lingual association.

===== Properties of Pretrained Models =====
  * [[https://arxiv.org/pdf/2007.06778.pdf|Tu et al 2020 - An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models]]

====== Pretraining Methodology ======
See also [[ml:scaling laws]].
  * **Blog posts**:
    * [[https://www.yitay.net/blog/training-great-llms-entirely-from-ground-zero-in-the-wilderness|Yi Tay - Training great LLMs entirely from ground up in the wilderness as a startup]]
  * **Papers**
    * [[https://arxiv.org/pdf/1906.06669.pdf|Komatsuzaki 2019 - One Epoch Is All You Need]]
    * [[https://arxiv.org/pdf/2104.07705.pdf|Izsak et al 2021 - How to Train BERT with an Academic Budget]]
    * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well.  They pretrain a 7B model from scratch on an RTX 4090.
    * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]]
    * [[https://arxiv.org/pdf/2410.23261|Khandelwal et al 2024 - $100 K or 100 Days: Trade-offs when Pre-Training with Academic Resources]]
  * **System Descriptions**
    * The following papers contain very useful descriptions of LLM pretraining methods and issues
    * [[https://arxiv.org/pdf/2205.01068.pdf|Zhang et al 2022 - OPT: Open Pre-trained Transformer Language Models]] Discusses loss spikes, etc.
    * [[https://arxiv.org/pdf/2304.08442.pdf|Kaddour 2023 - The MiniPile Challenge for Data-Efficient Language Models]]
    * [[https://arxiv.org/pdf/2303.08774|OpenAI 2023 - GPT-4 Technical Report]]
    * [[https://arxiv.org/pdf/2305.10403.pdf|Google 2023 - PaLM 2 Technical Report]] Talks about scaling laws, etc
    * [[https://arxiv.org/pdf/2309.16609|Bai et al 2023 - Qwen Technical Report]] Good information
    * [[https://arxiv.org/pdf/2401.02385|Zhang et al 2024 - TinyLlama: An Open-Source Small Language Model]]
    * [[https://arxiv.org/pdf/2401.12246.pdf|2024 - Orion-14B: Open-source Multilingual Large Language Models]]
    * [[https://arxiv.org/pdf/2402.00838.pdf|Groeneveld et al 2024 - OLMo: Accelerating the Science of Language Models]]
    * [[https://arxiv.org/pdf/2403.17297|Cai et al 2024 - InternLM2 Technical Report]] Open and meticulously detailed
    * [[https://arxiv.org/pdf/2407.21783|Llama Team 2024 - The Llama 3 Herd of Models]]
    * [[https://arxiv.org/pdf/2412.19437|2024 - DeepSeek-V3 Technical Report]]

===== Amount, Selection and Cleaning of Pretraining Data =====
  * **Overviews**
    * [[https://arxiv.org/pdf/2305.13169.pdf|Longpre et al 2023 - A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity]]
  * **Amount**
    * [[https://arxiv.org/pdf/2011.04946.pdf|Zhang et al 2020 - When Do You Need Billions of Words of Pretraining Data?]]
    * [[https://arxiv.org/pdf/2109.03160.pdf|2021 - How much pretraining data do language models need to learn syntax?]]
  * **Selection**
    * [[https://arxiv.org/pdf/2209.14389.pdf|Krishna et al 2022 - Downstream Datasets Make Surprisingly Good Pretraining Corpora]]
    * **[[https://arxiv.org/pdf/2302.03169.pdf|Xie et al 2023 - Data Selection for Language Models via Importance Resampling]]** The advantage of this method is it is very fast: data selection from the Pile can be done in 4 hours on a single computer.
    * [[https://arxiv.org/pdf/2305.12816.pdf|Wang et al 2023 - Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model]]
    * [[https://arxiv.org/pdf/2305.10429.pdf|Xie et a 2023 - DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining]]
    * [[https://arxiv.org/pdf/2309.04564|Marion et al 2023 - When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale]]
    * [[https://arxiv.org/pdf/2403.16952|Ye et al 2024 - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance]]
    * **[[https://arxiv.org/pdf/2404.07177.pdf|Goyal et al 2024 - Scaling Laws for Data Filtering: Data Curation cannot be Compute Agnostic]]**
    * [[https://arxiv.org/pdf/2502.15950|Belenki et al 2025 - Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models]]


===== Pretraining On An Academic Budget =====
Papers or projects where people have pretrained LLMs with academic compute budgets.

  * [[https://arxiv.org/pdf/2104.07705.pdf|Izsak et al 2021 - How to Train BERT with an Academic Budget]]
  * [[https://arxiv.org/pdf/2212.14034.pdf|Geiping & Goldstein 2022 - Training a Language Model on a Single GPU in One Day]]
  * [[https://arxiv.org/pdf/2304.08442.pdf|Kaddour 2023 - The MiniPile Challenge for Data-Efficient Language Models]]
  * [[https://github.com/PiotrNawrot/nanoT5|nanoT5]]
  * [[https://github.com/karpathy/nanoGPT|nanoGPT]] Reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training.
  * [[https://arxiv.org/pdf/2401.02385|Zhang et al 2024 - TinyLlama: An Open-Source Small Language Model]]
  * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well.  They pretrain a 7B model from scratch on an RTX 4090.

====== Software ======
  * [[https://github.com/EleutherAI/gpt-neo|GPT Neo]] An open-source implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.
  * [[https://huggingface.co/docs/transformers/index|Huggingface Transformers library]] has a large number of pre-trained models.  You can see a list in the github repo [[https://github.com/huggingface/transformers/tree/master/src/transformers/models|here]]

====== Related Pages ======
  * [[BERT and friends]]
  * [[ml:Catastrophic Forgetting]]
  * [[ml:Fine-Tuning]]
  * [[Language Model]]
  * [[nlp:vision_and_language#Multimodal Pretraining]]
  * [[Prompting]]
  * [[ml:Semi-supervised Learning]]
  * [[Word Embeddings]]