User Tools

Site Tools


nlp:pretraining

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:pretraining [2024/08/02 03:32] – [Pretraining Methodology] jmflanignlp:pretraining [2026/02/20 06:35] (current) – [Key and Early Papers] jmflanig
Line 2: Line 2:
  
 ===== Overviews ===== ===== Overviews =====
 +See also [[language_model#overviews|Language Model - Overviews]].
   * **[[https://arxiv.org/pdf/2003.07278.pdf|Liu et al 2020 - A Survey on Contextual Embeddings]]**   * **[[https://arxiv.org/pdf/2003.07278.pdf|Liu et al 2020 - A Survey on Contextual Embeddings]]**
   * [[https://arxiv.org/pdf/2003.08271.pdf|Qiu et al 2020 - Pre-trained Models for Natural Language Processing: A Survey]] Nice tables of pretraining methods on page 9 and 10, see [[Pretraining#Taxonomy of Pretraining Methods]] below.   * [[https://arxiv.org/pdf/2003.08271.pdf|Qiu et al 2020 - Pre-trained Models for Natural Language Processing: A Survey]] Nice tables of pretraining methods on page 9 and 10, see [[Pretraining#Taxonomy of Pretraining Methods]] below.
Line 8: Line 9:
 ===== Key and Early Papers ===== ===== Key and Early Papers =====
 For a history, see section 2.4 of [[https://arxiv.org/pdf/2003.08271.pdf|Qiu 2020]] or the related work in the [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|GPT-2 paper]]. For a history, see section 2.4 of [[https://arxiv.org/pdf/2003.08271.pdf|Qiu 2020]] or the related work in the [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|GPT-2 paper]].
 +  * [[https://arxiv.org/pdf/1103.0398|Collobert et al 2011 - Natural Language Processing (almost) from Scratch]]
 +  * [[https://arxiv.org/pdf/1506.06726|Kiros et al 2015 - Skip-Thought Vectors]]
   * [[https://arxiv.org/pdf/1511.01432.pdf|Dai et al 2015 - Semi-supervised Sequence Learning]]   * [[https://arxiv.org/pdf/1511.01432.pdf|Dai et al 2015 - Semi-supervised Sequence Learning]]
 +  * [[https://arxiv.org/pdf/1705.00108|Peters et al 2017 - Semi-supervised Sequence Tagging with Bidirectional Language Models]]
   * [[https://arxiv.org/pdf/1611.02683.pdf|Ramachandran et al 2017 - Unsupervised Pretraining for Sequence to Sequence Learning]]   * [[https://arxiv.org/pdf/1611.02683.pdf|Ramachandran et al 2017 - Unsupervised Pretraining for Sequence to Sequence Learning]]
   * [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]]   * [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]]
Line 21: Line 25:
 Papers sorted chronologically.  For a large list of pre-trained models, see [[https://github.com/huggingface/transformers/tree/master/src/transformers/models|here]]. Papers sorted chronologically.  For a large list of pre-trained models, see [[https://github.com/huggingface/transformers/tree/master/src/transformers/models|here]].
   * CoVe: [[https://arxiv.org/pdf/1708.00107.pdf|McCann et al 2017 - Learned in Translation: Contextualized Word Vectors]]   * CoVe: [[https://arxiv.org/pdf/1708.00107.pdf|McCann et al 2017 - Learned in Translation: Contextualized Word Vectors]]
 +  * ULMFiT: [[https://arxiv.org/pdf/1801.06146|Howard & Ruder 2018 - Universal Language Model Fine-tuning for Text Classification]]
   * ELMO: [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]]   * ELMO: [[https://arxiv.org/pdf/1802.05365.pdf|Peters et al 2018 - Deep Contextualized Word Representations]]
   * GPT: [[https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf|Radford et al 2018 - Improving Language Understanding by Generative Pre-Training]]   * GPT: [[https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf|Radford et al 2018 - Improving Language Understanding by Generative Pre-Training]]
Line 104: Line 109:
     * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well.  They pretrain a 7B model from scratch on an RTX 4090.     * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well.  They pretrain a 7B model from scratch on an RTX 4090.
     * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]]     * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]]
 +    * [[https://arxiv.org/pdf/2410.23261|Khandelwal et al 2024 - $100 K or 100 Days: Trade-offs when Pre-Training with Academic Resources]]
   * **System Descriptions**   * **System Descriptions**
     * The following papers contain very useful descriptions of LLM pretraining methods and issues     * The following papers contain very useful descriptions of LLM pretraining methods and issues
     * [[https://arxiv.org/pdf/2205.01068.pdf|Zhang et al 2022 - OPT: Open Pre-trained Transformer Language Models]] Discusses loss spikes, etc.     * [[https://arxiv.org/pdf/2205.01068.pdf|Zhang et al 2022 - OPT: Open Pre-trained Transformer Language Models]] Discusses loss spikes, etc.
     * [[https://arxiv.org/pdf/2304.08442.pdf|Kaddour 2023 - The MiniPile Challenge for Data-Efficient Language Models]]     * [[https://arxiv.org/pdf/2304.08442.pdf|Kaddour 2023 - The MiniPile Challenge for Data-Efficient Language Models]]
 +    * [[https://arxiv.org/pdf/2303.08774|OpenAI 2023 - GPT-4 Technical Report]]
     * [[https://arxiv.org/pdf/2305.10403.pdf|Google 2023 - PaLM 2 Technical Report]] Talks about scaling laws, etc     * [[https://arxiv.org/pdf/2305.10403.pdf|Google 2023 - PaLM 2 Technical Report]] Talks about scaling laws, etc
 +    * [[https://arxiv.org/pdf/2309.16609|Bai et al 2023 - Qwen Technical Report]] Good information
 +    * [[https://arxiv.org/pdf/2401.02385|Zhang et al 2024 - TinyLlama: An Open-Source Small Language Model]]
     * [[https://arxiv.org/pdf/2401.12246.pdf|2024 - Orion-14B: Open-source Multilingual Large Language Models]]     * [[https://arxiv.org/pdf/2401.12246.pdf|2024 - Orion-14B: Open-source Multilingual Large Language Models]]
     * [[https://arxiv.org/pdf/2402.00838.pdf|Groeneveld et al 2024 - OLMo: Accelerating the Science of Language Models]]     * [[https://arxiv.org/pdf/2402.00838.pdf|Groeneveld et al 2024 - OLMo: Accelerating the Science of Language Models]]
 +    * [[https://arxiv.org/pdf/2403.17297|Cai et al 2024 - InternLM2 Technical Report]] Open and meticulously detailed
     * [[https://arxiv.org/pdf/2407.21783|Llama Team 2024 - The Llama 3 Herd of Models]]     * [[https://arxiv.org/pdf/2407.21783|Llama Team 2024 - The Llama 3 Herd of Models]]
 +    * [[https://arxiv.org/pdf/2412.19437|2024 - DeepSeek-V3 Technical Report]]
  
 ===== Amount, Selection and Cleaning of Pretraining Data ===== ===== Amount, Selection and Cleaning of Pretraining Data =====
Line 124: Line 135:
     * [[https://arxiv.org/pdf/2305.12816.pdf|Wang et al 2023 - Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model]]     * [[https://arxiv.org/pdf/2305.12816.pdf|Wang et al 2023 - Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model]]
     * [[https://arxiv.org/pdf/2305.10429.pdf|Xie et a 2023 - DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining]]     * [[https://arxiv.org/pdf/2305.10429.pdf|Xie et a 2023 - DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining]]
 +    * [[https://arxiv.org/pdf/2309.04564|Marion et al 2023 - When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale]]
     * [[https://arxiv.org/pdf/2403.16952|Ye et al 2024 - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance]]     * [[https://arxiv.org/pdf/2403.16952|Ye et al 2024 - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance]]
     * **[[https://arxiv.org/pdf/2404.07177.pdf|Goyal et al 2024 - Scaling Laws for Data Filtering: Data Curation cannot be Compute Agnostic]]**     * **[[https://arxiv.org/pdf/2404.07177.pdf|Goyal et al 2024 - Scaling Laws for Data Filtering: Data Curation cannot be Compute Agnostic]]**
 +    * [[https://arxiv.org/pdf/2502.15950|Belenki et al 2025 - Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models]]
  
  
Line 136: Line 149:
   * [[https://github.com/PiotrNawrot/nanoT5|nanoT5]]   * [[https://github.com/PiotrNawrot/nanoT5|nanoT5]]
   * [[https://github.com/karpathy/nanoGPT|nanoGPT]] Reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training.   * [[https://github.com/karpathy/nanoGPT|nanoGPT]] Reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training.
 +  * [[https://arxiv.org/pdf/2401.02385|Zhang et al 2024 - TinyLlama: An Open-Source Small Language Model]]
   * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well.  They pretrain a 7B model from scratch on an RTX 4090.   * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well.  They pretrain a 7B model from scratch on an RTX 4090.
  
nlp/pretraining.1722569520.txt.gz · Last modified: 2024/08/02 03:32 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki