Differences

This shows you the differences between two versions of the page.

--- nlp:pretraining [2024/08/02 03:32] – [Pretraining Methodology] jmflanig
+++ nlp:pretraining [2026/02/20 06:35] (current) – [Key and Early Papers] jmflanig
@@ Line 2: / Line 2: @@
 ===== Overviews =====
+See also [[language_model#overviews|Language Model - Overviews]].
   * **[[https://arxiv.org/pdf/2003.07278.pdf|Liu et al 2020 - A Survey on Contextual Embeddings]]**
   * [[https://arxiv.org/pdf/2003.08271.pdf|Qiu et al 2020 - Pre-trained Models for Natural Language Processing: A Survey]] Nice tables of pretraining methods on page 9 and 10, see [[Pretraining#Taxonomy of Pretraining Methods]] below.
@@ Line 8: / Line 9: @@
 ===== Key and Early Papers =====
 For a history, see section 2.4 of [[https://arxiv.org/pdf/2003.08271.pdf|Qiu 2020]] or the related work in the [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|GPT-2 paper]].
+  * [[https://arxiv.org/pdf/1103.0398|Collobert et al 2011 - Natural Language Processing (almost) from Scratch]]
+  * [[https://arxiv.org/pdf/1506.06726|Kiros et al 2015 - Skip-Thought Vectors]]
   * [[https://arxiv.org/pdf/1511.01432.pdf|Dai et al 2015 - Semi-supervised Sequence Learning]]
+  * [[https://arxiv.org/pdf/1705.00108|Peters et al 2017 - Semi-supervised Sequence Tagging with Bidirectional Language Models]]
   * [[https://arxiv.org/pdf/1611.02683.pdf|Ramachandran et al 2017 - Unsupervised Pretraining for Sequence to Sequence Learning]]
   * [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]]
@@ Line 21: / Line 25: @@
 Papers sorted chronologically.  For a large list of pre-trained models, see [[https://github.com/huggingface/transformers/tree/master/src/transformers/models|here]].
   * CoVe: [[https://arxiv.org/pdf/1708.00107.pdf|McCann et al 2017 - Learned in Translation: Contextualized Word Vectors]]
+  * ULMFiT: [[https://arxiv.org/pdf/1801.06146|Howard & Ruder 2018 - Universal Language Model Fine-tuning for Text Classification]]
   * ELMO: [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]]
   * GPT: [[https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf|Radford et al 2018 - Improving Language Understanding by Generative Pre-Training]]
@@ Line 104: / Line 109: @@
     * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well.  They pretrain a 7B model from scratch on an RTX 4090.
     * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]]
+    * [[https://arxiv.org/pdf/2410.23261|Khandelwal et al 2024 - $100 K or 100 Days: Trade-offs when Pre-Training with Academic Resources]]
   * **System Descriptions**
     * The following papers contain very useful descriptions of LLM pretraining methods and issues
     * [[https://arxiv.org/pdf/2205.01068.pdf|Zhang et al 2022 - OPT: Open Pre-trained Transformer Language Models]] Discusses loss spikes, etc.
     * [[https://arxiv.org/pdf/2304.08442.pdf|Kaddour 2023 - The MiniPile Challenge for Data-Efficient Language Models]]
+    * [[https://arxiv.org/pdf/2303.08774|OpenAI 2023 - GPT-4 Technical Report]]
     * [[https://arxiv.org/pdf/2305.10403.pdf|Google 2023 - PaLM 2 Technical Report]] Talks about scaling laws, etc
+    * [[https://arxiv.org/pdf/2309.16609|Bai et al 2023 - Qwen Technical Report]] Good information
+    * [[https://arxiv.org/pdf/2401.02385|Zhang et al 2024 - TinyLlama: An Open-Source Small Language Model]]
     * [[https://arxiv.org/pdf/2401.12246.pdf|2024 - Orion-14B: Open-source Multilingual Large Language Models]]
     * [[https://arxiv.org/pdf/2402.00838.pdf|Groeneveld et al 2024 - OLMo: Accelerating the Science of Language Models]]
+    * [[https://arxiv.org/pdf/2403.17297|Cai et al 2024 - InternLM2 Technical Report]] Open and meticulously detailed
     * [[https://arxiv.org/pdf/2407.21783|Llama Team 2024 - The Llama 3 Herd of Models]]
+    * [[https://arxiv.org/pdf/2412.19437|2024 - DeepSeek-V3 Technical Report]]
 ===== Amount, Selection and Cleaning of Pretraining Data =====
@@ Line 124: / Line 135: @@
     * [[https://arxiv.org/pdf/2305.12816.pdf|Wang et al 2023 - Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model]]
     * [[https://arxiv.org/pdf/2305.10429.pdf|Xie et a 2023 - DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining]]
+    * [[https://arxiv.org/pdf/2309.04564|Marion et al 2023 - When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale]]
     * [[https://arxiv.org/pdf/2403.16952|Ye et al 2024 - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance]]
     * **[[https://arxiv.org/pdf/2404.07177.pdf|Goyal et al 2024 - Scaling Laws for Data Filtering: Data Curation cannot be Compute Agnostic]]**
+    * [[https://arxiv.org/pdf/2502.15950|Belenki et al 2025 - Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models]]
@@ Line 136: / Line 149: @@
   * [[https://github.com/PiotrNawrot/nanoT5|nanoT5]]
   * [[https://github.com/karpathy/nanoGPT|nanoGPT]] Reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training.
+  * [[https://arxiv.org/pdf/2401.02385|Zhang et al 2024 - TinyLlama: An Open-Source Small Language Model]]
   * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well.  They pretrain a 7B model from scratch on an RTX 4090.