====== Pretraining ====== ===== Overviews ===== See also [[language_model#overviews|Language Model - Overviews]]. * **[[https://arxiv.org/pdf/2003.07278.pdf|Liu et al 2020 - A Survey on Contextual Embeddings]]** * [[https://arxiv.org/pdf/2003.08271.pdf|Qiu et al 2020 - Pre-trained Models for Natural Language Processing: A Survey]] Nice tables of pretraining methods on page 9 and 10, see [[Pretraining#Taxonomy of Pretraining Methods]] below. * **[[https://arxiv.org/pdf/2303.18223.pdf|Zhao et al 2023 - A Survey of Large Language Models]]** ===== Key and Early Papers ===== For a history, see section 2.4 of [[https://arxiv.org/pdf/2003.08271.pdf|Qiu 2020]] or the related work in the [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|GPT-2 paper]]. * [[https://arxiv.org/pdf/1103.0398|Collobert et al 2011 - Natural Language Processing (almost) from Scratch]] * [[https://arxiv.org/pdf/1506.06726|Kiros et al 2015 - Skip-Thought Vectors]] * [[https://arxiv.org/pdf/1511.01432.pdf|Dai et al 2015 - Semi-supervised Sequence Learning]] * [[https://arxiv.org/pdf/1705.00108|Peters et al 2017 - Semi-supervised Sequence Tagging with Bidirectional Language Models]] * [[https://arxiv.org/pdf/1611.02683.pdf|Ramachandran et al 2017 - Unsupervised Pretraining for Sequence to Sequence Learning]] * [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]] * [[https://arxiv.org/pdf/1810.04805.pdf|Devlin et al 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]] * [[https://arxiv.org/pdf/2004.10964.pdf|Gururangan et al 2020 - Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks]] * [[https://arxiv.org/pdf/2011.04946.pdf|Zhang et al 2020 - When Do You Need Billions of Words of Pretraining Data?]] * [[https://arxiv.org/pdf/2209.14389.pdf|Krishna et al 2022 - Downstream Datasets Make Surprisingly Good Pretraining Corpora]] * [[https://arxiv.org/pdf/2305.06677.pdf|Renduchintala et al 2023 - INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models]] * [[https://arxiv.org/pdf/2305.12816.pdf|Wang et al 2023 - Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model]] ===== Contextualized Pretrained Models ====== Papers sorted chronologically. For a large list of pre-trained models, see [[https://github.com/huggingface/transformers/tree/master/src/transformers/models|here]]. * CoVe: [[https://arxiv.org/pdf/1708.00107.pdf|McCann et al 2017 - Learned in Translation: Contextualized Word Vectors]] * ULMFiT: [[https://arxiv.org/pdf/1801.06146|Howard & Ruder 2018 - Universal Language Model Fine-tuning for Text Classification]] * ELMO: [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]] * GPT: [[https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf|Radford et al 2018 - Improving Language Understanding by Generative Pre-Training]] * BERT: [[https://arxiv.org/pdf/1810.04805.pdf|Devlin et al 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]] [[https://github.com/google-research/bert|original github]] * XLM: [[https://arxiv.org/pdf/1901.07291.pdf|Lample et al 2019 - Cross-lingual Language Model Pretraining]] * GPT-2: [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|Radford et al 2019 - Language Models are Unsupervised Multitask Learners]] [[https://github.com/openai/gpt-2|original github]] [[https://amaarora.github.io/2020/02/18/annotatedGPT2.html|Annotated GPT-2]] [[https://jalammar.github.io/illustrated-gpt2/|Illustrated GPT-2]] Interestingly, GPT-2 does //not// include a bias term in the final linear layer for the vocab, see [[https://github.com/openai/gpt-2/blob/master/src/model.py#L171|here]] and [[https://github.com/huggingface/transformers/blob/v4.19.2/src/transformers/models/gpt2/modeling_gpt2.py#L951|here]]. * MASS: [[https://arxiv.org/pdf/1905.02450.pdf|Song et al 2019 - MASS: Masked Sequence to Sequence Pre-training for Language Generation]] * XLNet: [[https://arxiv.org/pdf/1906.08237.pdf|Yang et al 2019 - XLNet: Generalized Autoregressive Pretraining for Language Understanding]] * RoBERTa: [[https://arxiv.org/pdf/1907.11692.pdf|Liu et al 2019 - RoBERTa: A Robustly Optimized BERT Pretraining Approach]] * CTRL: [[https://arxiv.org/pdf/1909.05858.pdf|Keskar et al 2019 - CTRL: A Conditional Transformer Language Model for Controllable Generation]] * ALBERT: [[https://arxiv.org/pdf/1909.11942.pdf|Lan et al 2019 - ALBERT: A Lite BERT for Self-supervised Learning of Language Representations]] * T5: [[https://arxiv.org/pdf/1910.10683.pdf|Raffel et al 2019 - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer]] * BART: [[https://arxiv.org/pdf/1910.13461.pdf|Lewis et al 2020 - BART: Denoising Sequence-to-Sequence Pre-training for NaturalLanguage Generation, Translation, and Comprehension]] * XLM-R: [[https://arxiv.org/pdf/1911.02116.pdf|Conneau et al 2019 - Unsupervised Cross-lingual Representation Learning at Scale]] * ELECTRA: [[https://arxiv.org/pdf/2003.10555.pdf|Clark et al 2020 - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators]] * Longformer: [[https://arxiv.org/pdf/2004.05150.pdf|Beltagy et al 2020 - Longformer: The Long-Document Transformer]] * MPNet: [[https://arxiv.org/pdf/2004.09297.pdf|Song et al 2020 - MPNet: Masked and Permuted Pre-training for Language Understanding]] * GPT-3: [[https://arxiv.org/pdf/2005.14165.pdf|Brown et al 2020 - Language Models are Few-Shot Learners]] * DeBERTa: [[https://arxiv.org/pdf/2006.03654.pdf|He et al 2020 - DeBERTa: Decoding-enhanced BERT with Disentangled Attention]] "Improves the BERT and RoBERTa models using two novel techniques." * MARGE: [[https://arxiv.org/pdf/2006.15020.pdf|Lewis 2020 - Pre-training via Paraphrasing]] * BigBird: [[https://arxiv.org/pdf/2007.14062.pdf|Zaheer et al 2020 - Big Bird: Transformers for Longer Sequences]] * ConvBERT: [[https://arxiv.org/pdf/2008.02496.pdf|Jiang et al 2020 - ConvBERT: Improving BERT with Span-based Dynamic Convolution]] * Switch Transformer: [[https://arxiv.org/pdf/2101.03961.pdf|Fedus et al 2021 - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity]] * Muppet: [[https://arxiv.org/pdf/2101.11038.pdf|Aghajanyan et al 2021 - Muppet: Massive Multi-task Representations with Pre-Finetuning]] * [[https://arxiv.org/pdf/2105.03322.pdf|Tay et al 2021 - Are Pre-trained Convolutions Better than Pre-trained Transformers?]] * XLM-E: [[https://aclanthology.org/2022.acl-long.427.pdf|Chi et al 2022 - LM-E: Cross-lingual Language Model Pre-training via ELECTRA]] * InstructGPT: [[https://arxiv.org/pdf/2203.02155.pdf|Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback]] ===== Table of Large Models ===== List of popular models in chronological order. See also the list of [[nlp:language_model#Large Language Models]]. ^ Model ^ Year ^ Type ^ Parameters ^ Training Data ^ Objective ^ Public? ^ Notes ^ Link ^ | BERT | 2018 | Dec | | | | | | | | [[https://arxiv.org/pdf/1910.10683.pdf|T5]] | 2019 | Enc-Dec| 11B | [[https://www.tensorflow.org/datasets/catalog/c4|C4]] | | Yes | | [[https://github.com/google-research/text-to-text-transfer-transformer|github]] | | BART | | Enc-Dec | | | | | | | | | | | | | | | | | ===== Fine-Tuning Methods ===== Moved to [[ml:Fine-Tuning]]. ===== Other Papers ===== * [[https://arxiv.org/pdf/2108.07258.pdf|Bommasani et al 2021 - On the Opportunities and Risks of Foundation Models]] Talks about the benefits and potential issues with pretrained models. * [[https://arxiv.org/pdf/2206.10139.pdf|Wu et al 2022 - Insights into Pre-training via Simpler Synthetic Tasks]] ===== Complex Pre-training Methods ===== * [[https://arxiv.org/pdf/2212.10449.pdf|Pagnoni et 2022 - Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization]] - Pre-training on generated questions ===== Taxonomy of Pretraining Methods ===== {{media:pretraining_taxonomy.png}} \\ Figure from [[https://arxiv.org/pdf/2003.08271.pdf|Qiu 2020]]. {{media:pretrain_models.png}} \\ Figure from [[https://arxiv.org/pdf/2003.07278.pdf|Liu 2020]]. {{media:pretrain_models_2.png}} \\ Figure from [[https://arxiv.org/pdf/2003.08271.pdf|Qiu 2020]]. {{media:pretrain_objectives.png}} \\ Figure from [[https://arxiv.org/pdf/2003.07278.pdf|Liu 2020]]. Key: * LM: language modeling * MLM: masked language modeling * NSP: next sentence prediction * SOP: sentence order prediction * Discriminator (o/r): predict for each word if it was replaced (r ) or not (o, original) * seq2seq LM: given a prefix of words in a sequence, predict the rest of the sequence * Span Mask: predict masked words, where the masked words are contiguous (a span) * Text Infilling: Spans of words are replaced with a single mask token. Must predict all the words in the masked span. * Sent shuffling: Unshuffle a shuffled sentence * TLM: (Translation Language Modeling) Tokens in both source and target sequences are masked for learning cross-lingual association. ===== Properties of Pretrained Models ===== * [[https://arxiv.org/pdf/2007.06778.pdf|Tu et al 2020 - An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models]] ====== Pretraining Methodology ====== See also [[ml:scaling laws]]. * **Blog posts**: * [[https://www.yitay.net/blog/training-great-llms-entirely-from-ground-zero-in-the-wilderness|Yi Tay - Training great LLMs entirely from ground up in the wilderness as a startup]] * **Papers** * [[https://arxiv.org/pdf/1906.06669.pdf|Komatsuzaki 2019 - One Epoch Is All You Need]] * [[https://arxiv.org/pdf/2104.07705.pdf|Izsak et al 2021 - How to Train BERT with an Academic Budget]] * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well. They pretrain a 7B model from scratch on an RTX 4090. * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]] * [[https://arxiv.org/pdf/2410.23261|Khandelwal et al 2024 - $100 K or 100 Days: Trade-offs when Pre-Training with Academic Resources]] * **System Descriptions** * The following papers contain very useful descriptions of LLM pretraining methods and issues * [[https://arxiv.org/pdf/2205.01068.pdf|Zhang et al 2022 - OPT: Open Pre-trained Transformer Language Models]] Discusses loss spikes, etc. * [[https://arxiv.org/pdf/2304.08442.pdf|Kaddour 2023 - The MiniPile Challenge for Data-Efficient Language Models]] * [[https://arxiv.org/pdf/2303.08774|OpenAI 2023 - GPT-4 Technical Report]] * [[https://arxiv.org/pdf/2305.10403.pdf|Google 2023 - PaLM 2 Technical Report]] Talks about scaling laws, etc * [[https://arxiv.org/pdf/2309.16609|Bai et al 2023 - Qwen Technical Report]] Good information * [[https://arxiv.org/pdf/2401.02385|Zhang et al 2024 - TinyLlama: An Open-Source Small Language Model]] * [[https://arxiv.org/pdf/2401.12246.pdf|2024 - Orion-14B: Open-source Multilingual Large Language Models]] * [[https://arxiv.org/pdf/2402.00838.pdf|Groeneveld et al 2024 - OLMo: Accelerating the Science of Language Models]] * [[https://arxiv.org/pdf/2403.17297|Cai et al 2024 - InternLM2 Technical Report]] Open and meticulously detailed * [[https://arxiv.org/pdf/2407.21783|Llama Team 2024 - The Llama 3 Herd of Models]] * [[https://arxiv.org/pdf/2412.19437|2024 - DeepSeek-V3 Technical Report]] ===== Amount, Selection and Cleaning of Pretraining Data ===== * **Overviews** * [[https://arxiv.org/pdf/2305.13169.pdf|Longpre et al 2023 - A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity]] * **Amount** * [[https://arxiv.org/pdf/2011.04946.pdf|Zhang et al 2020 - When Do You Need Billions of Words of Pretraining Data?]] * [[https://arxiv.org/pdf/2109.03160.pdf|2021 - How much pretraining data do language models need to learn syntax?]] * **Selection** * [[https://arxiv.org/pdf/2209.14389.pdf|Krishna et al 2022 - Downstream Datasets Make Surprisingly Good Pretraining Corpora]] * **[[https://arxiv.org/pdf/2302.03169.pdf|Xie et al 2023 - Data Selection for Language Models via Importance Resampling]]** The advantage of this method is it is very fast: data selection from the Pile can be done in 4 hours on a single computer. * [[https://arxiv.org/pdf/2305.12816.pdf|Wang et al 2023 - Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model]] * [[https://arxiv.org/pdf/2305.10429.pdf|Xie et a 2023 - DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining]] * [[https://arxiv.org/pdf/2309.04564|Marion et al 2023 - When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale]] * [[https://arxiv.org/pdf/2403.16952|Ye et al 2024 - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance]] * **[[https://arxiv.org/pdf/2404.07177.pdf|Goyal et al 2024 - Scaling Laws for Data Filtering: Data Curation cannot be Compute Agnostic]]** * [[https://arxiv.org/pdf/2502.15950|Belenki et al 2025 - Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models]] ===== Pretraining On An Academic Budget ===== Papers or projects where people have pretrained LLMs with academic compute budgets. * [[https://arxiv.org/pdf/2104.07705.pdf|Izsak et al 2021 - How to Train BERT with an Academic Budget]] * [[https://arxiv.org/pdf/2212.14034.pdf|Geiping & Goldstein 2022 - Training a Language Model on a Single GPU in One Day]] * [[https://arxiv.org/pdf/2304.08442.pdf|Kaddour 2023 - The MiniPile Challenge for Data-Efficient Language Models]] * [[https://github.com/PiotrNawrot/nanoT5|nanoT5]] * [[https://github.com/karpathy/nanoGPT|nanoGPT]] Reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training. * [[https://arxiv.org/pdf/2401.02385|Zhang et al 2024 - TinyLlama: An Open-Source Small Language Model]] * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well. They pretrain a 7B model from scratch on an RTX 4090. ====== Software ====== * [[https://github.com/EleutherAI/gpt-neo|GPT Neo]] An open-source implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. * [[https://huggingface.co/docs/transformers/index|Huggingface Transformers library]] has a large number of pre-trained models. You can see a list in the github repo [[https://github.com/huggingface/transformers/tree/master/src/transformers/models|here]] ====== Related Pages ====== * [[BERT and friends]] * [[ml:Catastrophic Forgetting]] * [[ml:Fine-Tuning]] * [[Language Model]] * [[nlp:vision_and_language#Multimodal Pretraining]] * [[Prompting]] * [[ml:Semi-supervised Learning]] * [[Word Embeddings]]