This is an old revision of the document!

Distributed Training

Overviews

Concise summary in the introduction and related work here: Aji 2017
For a modern overview, see section 3.4 of Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference
Blog posts
- Parameter server Warning: contains some conceptual errors
- An Overview of Pipeline Parallelism and its Research Progress

Papers

Distributed Optimization

Used as the optimizer in both data-parallel and model-parallel training (usually the method can be used for both).

Parameter server
Hogwild
Seide et al 2014 - 1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs
Aji & Heafield 2017 - Sparse Communication for Distributed Gradient Descent Has a good summary of related work
LAMB optimizer: You et al 2020 - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes Distributed on a single TPU Pod. Reduces training time from 3 days to 76 minutes on a TPU Pod.

Data Parallel

Li et al 2020 - PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Model Parallel (or a combination of model + data parallel)

Pipeline parallel, which partitions groups of layers across different accelerators in a pipeline, can be thought of as a special case of model parallel. This is used in GPipe, etc. Tensor slicing (or tensor parallel) is another type of model parallelization, used in Megatron-LM. Megatron-Turning NLG uses both types.

Krizhevsky 2014 - One weird trick for parallelizing convolutional neural networks
Shazeer et al 2018 - Mesh-TensorFlow: Deep Learning for Supercomputers Distributes tensor operations (matrix multiplication) across devices through by tiling across devices. Used in T5.
Huang et al 2018 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Partitions groups of layers across different accelerators in a pipeline, resulting in linear speedup across multiple accelerators. Works because not much communication overhead between layers. Megatron-Turning NLG uses this (page 6) along with other methods.
Shoeybi et al 2019 - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Proposes a particular split across devices for both the feedforward and attention layers in the Transformer. Used to train a 8.3B LM. blog github
Rajbhandari et al 2019 - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Lepikhin et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Ren et al 2021 - ZeRO-Offload: Democratizing Billion-Scale Model Training Code and tutorial
Rajbhandari et al 2021 - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
Zheng et al 2022 - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Burham et al 2022 - Pathways: Asynchronous Distributed Dataflow for ML Used to train PaLM
Qi et al 2023 - Zero Bubble Pipeline Parallelism

Software

PyTorch Distributed
DeepSpeed, Github Highly optimized distributed training for PyTorch. See their related papers like ZeRO-Offload
Mesh-Tensorflow
Megatron-LM Distributed training for language models. Used for example by Bloom.
HuggingFace Accelerate Easy distributed training for PyTorch models (blog post introduction) I believe this is essentially GPipe
Alpa paper ICML 2022 Tutorial Developed at Berkeley
PyTorch Fully Sharded Data Parallel paper This is what Chris T. used for pre-training experiments

Conferences and Workshops

MLSys

NLP Wiki

Table of Contents