ml:scaling_laws

Table of Contents

Scaling Laws
- Papers
  - Training LLMs
  - Emergent Abilities
- Related Pages

Scaling Laws

Scaling laws are used to pick optimal hyperparameters for large models.

Papers

Kaplan et al 2020 - Scaling Laws for Neural Language Models
Hoffmann et al 2022 - Training Compute-Optimal Large Language Models
Tay et al 2022 - Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?
Ruan et al 2024 - Observational Scaling Laws and the Predictability of Language Model Performance Does a multi-dimensional regression (fitting a sigmoid) to predict the performance across model “families” (LLaMA, GPT-3, etc)
Porian et al 2024 - Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Choshen et al 2024 - A Hitchhiker's Guide to Scaling Law Estimation

Training LLMs

Large models are usually trained with scaling laws in mind (often compute optimal for deployment, not training). See for example:
- Google 2023 - PaLM 2 Technical Report (see section 2)

Emergent Abilities

See also Language Model - Origin of Capabilities.

GPT-3: Brown et al 2021 - Language Models are Few-Shot Learners GPT-3 showed emergent abilities. See for example Fig 3.10.
Wei et al 2022 - Emergent Abilities of Large Language Models
Schaeffer et al 2023 - Are Emergent Abilities of Large Language Models a Mirage?
Hu et al 2023 - Predicting Emergent Abilities with Infinite Resolution Evaluation Does bootstrap resampling to get a very fine-grained measure of model capabilities by resampling until they get the desire behavior a certain number of times. Can be used to predict emergent capabilities from very small models that rarely exhibit the desired behavior.
Du et al 2024 - Understanding Emergent Abilities of Language Models from the Loss Perspective

Related Pages

ml/scaling_laws.txt · Last modified: 2025/06/01 23:09 by jmflanig