nlp:pretraining
Table of Contents
Pretraining
Overviews
See also Language Model - Overviews.
- Qiu et al 2020 - Pre-trained Models for Natural Language Processing: A Survey Nice tables of pretraining methods on page 9 and 10, see Taxonomy of Pretraining Methods below.
Key and Early Papers
For a history, see section 2.4 of Qiu 2020 or the related work in the GPT-2 paper.
Contextualized Pretrained Models
Papers sorted chronologically. For a large list of pre-trained models, see here.
- GPT-2: Radford et al 2019 - Language Models are Unsupervised Multitask Learners original github Annotated GPT-2 Illustrated GPT-2 Interestingly, GPT-2 does not include a bias term in the final linear layer for the vocab, see here and here.
- DeBERTa: He et al 2020 - DeBERTa: Decoding-enhanced BERT with Disentangled Attention “Improves the BERT and RoBERTa models using two novel techniques.”
Table of Large Models
List of popular models in chronological order. See also the list of Large Language Models.
Fine-Tuning Methods
Moved to Fine-Tuning.
Other Papers
- Bommasani et al 2021 - On the Opportunities and Risks of Foundation Models Talks about the benefits and potential issues with pretrained models.
Complex Pre-training Methods
- Pagnoni et 2022 - Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization - Pre-training on generated questions
Taxonomy of Pretraining Methods
Figure from Qiu 2020.
Figure from Liu 2020.
Figure from Qiu 2020.
Figure from Liu 2020. Key:
- LM: language modeling
- MLM: masked language modeling
- NSP: next sentence prediction
- SOP: sentence order prediction
- Discriminator (o/r): predict for each word if it was replaced (r ) or not (o, original)
- seq2seq LM: given a prefix of words in a sequence, predict the rest of the sequence
- Span Mask: predict masked words, where the masked words are contiguous (a span)
- Text Infilling: Spans of words are replaced with a single mask token. Must predict all the words in the masked span.
- Sent shuffling: Unshuffle a shuffled sentence
- TLM: (Translation Language Modeling) Tokens in both source and target sequences are masked for learning cross-lingual association.
Properties of Pretrained Models
Pretraining Methodology
See also scaling laws.
- Blog posts:
- Papers
- Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Like LoRA, but can be used for pre-training as well. They pretrain a 7B model from scratch on an RTX 4090.
- System Descriptions
- The following papers contain very useful descriptions of LLM pretraining methods and issues
- Zhang et al 2022 - OPT: Open Pre-trained Transformer Language Models Discusses loss spikes, etc.
- Google 2023 - PaLM 2 Technical Report Talks about scaling laws, etc
- Bai et al 2023 - Qwen Technical Report Good information
- Cai et al 2024 - InternLM2 Technical Report Open and meticulously detailed
Amount, Selection and Cleaning of Pretraining Data
- Overviews
- Amount
- Selection
- Xie et al 2023 - Data Selection for Language Models via Importance Resampling The advantage of this method is it is very fast: data selection from the Pile can be done in 4 hours on a single computer.
Pretraining On An Academic Budget
Papers or projects where people have pretrained LLMs with academic compute budgets.
- nanoGPT Reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training.
- Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Like LoRA, but can be used for pre-training as well. They pretrain a 7B model from scratch on an RTX 4090.
Software
- GPT Neo An open-source implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.
- Huggingface Transformers library has a large number of pre-trained models. You can see a list in the github repo here
Related Pages
nlp/pretraining.txt · Last modified: 2026/02/20 06:35 by jmflanig