ml:distributed_training
Table of Contents
Distributed Training and Inference
Overviews
- Concise summary in the introduction and related work here: Aji 2017
- For a modern overview, see section 3.4 of Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference
- A good overview is in section 3.3.2 of the The Llama 3 technical report Uses tensor parallelism, pipeline parallelism, context parallelism and data parallelism.
- Chaha et al 2018 - A Hitchhiker’s Guide On Distributed Training of Deep Neural Networks Older, but thorough at the time
- Targeted Overviews
- Blog posts
- Parameter server Warning: contains some conceptual errors
Papers
Distributed Optimization
Used as the optimizer in both data-parallel and model-parallel training (usually the method can be used for both).
- Aji & Heafield 2017 - Sparse Communication for Distributed Gradient Descent Has a good summary of related work
- LAMB optimizer: You et al 2020 - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes Distributed on a single TPU Pod. Reduces training time from 3 days to 76 minutes on a TPU Pod.
Data Parallel
Model Parallel (or a combination of model + data parallel)
Pipeline parallel, which partitions groups of layers across different accelerators in a pipeline, can be thought of as a special case of model parallel. This is used in GPipe, etc. Tensor slicing (or tensor parallel) is another type of model parallelization, used in Megatron-LM. Megatron-Turning NLG uses both types.
- Shazeer et al 2018 - Mesh-TensorFlow: Deep Learning for Supercomputers Distributes tensor operations (matrix multiplication) across devices through by tiling across devices. Used in T5.
- Huang et al 2018 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Partitions groups of layers across different accelerators in a pipeline, resulting in linear speedup across multiple accelerators. Works because not much communication overhead between layers. Megatron-Turning NLG uses this (page 6) along with other methods.
- Shoeybi et al 2019 - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism This paper introduced what is now called tensor parallelism. Proposes a particular split across devices for both the feedforward and attention layers in the Transformer. Used to train a 8.3B LM. blog github
- Rajbhandari et al 2019 - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Allows training up to 100B LM.
- Lepikhin et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Trains a Sparsely-Gated Mixture-of-Experts model up to 600B LM.
Distributed Serving (Inference)
Network (Design and Topology)
Software
- Mesh-Tensorflow
- Megatron-LM Distributed training for language models. Used for example by Bloom.
- HuggingFace Accelerate Easy distributed training for PyTorch models (blog post introduction) I believe this is essentially GPipe
- PyTorch Fully Sharded Data Parallel paper This is what Chris T. used for pre-training experiments
Conferences and Workshops
Related Pages
ml/distributed_training.txt · Last modified: 2025/05/29 07:18 by jmflanig