====== Scaling Laws ====== Scaling laws are used to pick optimal hyperparameters for large models. ===== Papers ===== * [[https://arxiv.org/pdf/2001.08361.pdf|Kaplan et al 2020 - Scaling Laws for Neural Language Models]] * [[https://arxiv.org/pdf/2203.15556.pdf|Hoffmann et al 2022 - Training Compute-Optimal Large Language Models]] * [[https://arxiv.org/pdf/2207.10551.pdf|Tay et al 2022 - Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?]] * **[[https://arxiv.org/pdf/2405.10938|Ruan et al 2024 - Observational Scaling Laws and the Predictability of Language Model Performance]]** Does a multi-dimensional regression (fitting a sigmoid) to predict the performance across model "families" (LLaMA, GPT-3, etc) * [[https://arxiv.org/pdf/2406.19146|Porian et al 2024 - Resolving Discrepancies in Compute-Optimal Scaling of Language Models]] * [[https://arxiv.org/pdf/2410.11840|Choshen et al 2024 - A Hitchhiker's Guide to Scaling Law Estimation]] ==== Training LLMs ==== * Large models are usually trained with scaling laws in mind (often compute optimal for deployment, not training). See for example: * [[https://ai.google/static/documents/palm2techreport.pdf|Google 2023 - PaLM 2 Technical Report]] (see section 2) ==== Emergent Abilities ==== See also [[nlp:Language Model#Origin of Capabilities|Language Model - Origin of Capabilities]]. * GPT-3: [[https://arxiv.org/pdf/2005.14165.pdf|Brown et al 2021 - Language Models are Few-Shot Learners]] GPT-3 showed emergent abilities. See for example Fig 3.10. * [[https://arxiv.org/pdf/2206.07682|Wei et al 2022 - Emergent Abilities of Large Language Models]] * [[https://arxiv.org/pdf/2304.15004|Schaeffer et al 2023 - Are Emergent Abilities of Large Language Models a Mirage?]] * **[[https://arxiv.org/pdf/2310.03262|Hu et al 2023 - Predicting Emergent Abilities with Infinite Resolution Evaluation]]** Does bootstrap resampling to get a very fine-grained measure of model capabilities by resampling until they get the desire behavior a certain number of times. Can be used to predict emergent capabilities from very small models that rarely exhibit the desired behavior. * [[https://arxiv.org/pdf/2403.15796|Du et al 2024 - Understanding Emergent Abilities of Language Models from the Loss Perspective]] ===== Related Pages ===== * [[Hyperparameter Tuning]] * [[nlp:Language Model]] * [[nlp:Language Model#Origin of Capabilities|Language Model - Origin of Capabilities]] * [[nlp:pretraining#Pretraining Methodology]]