ml:distributed_training
This is an old revision of the document!
Table of Contents
Distributed Training
Overviews
- Concise summary in the introduction and related work here: Aji 2017
- For a modern overview, see section 3.4 of Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference
- Targeted Overviews
- Blog posts
- Parameter server Warning: contains some conceptual errors
Papers
Distributed Optimization
Used as the optimizer in both data-parallel and model-parallel training (usually the method can be used for both).
- Aji & Heafield 2017 - Sparse Communication for Distributed Gradient Descent Has a good summary of related work
- LAMB optimizer: You et al 2020 - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes Distributed on a single TPU Pod. Reduces training time from 3 days to 76 minutes on a TPU Pod.
Data Parallel
Model Parallel (or a combination of model + data parallel)
Pipeline parallel, which partitions groups of layers across different accelerators in a pipeline, can be thought of as a special case of model parallel. This is used in GPipe, etc. Tensor slicing (or tensor parallel) is another type of model parallelization, used in Megatron-LM. Megatron-Turning NLG uses both types.
- Shazeer et al 2018 - Mesh-TensorFlow: Deep Learning for Supercomputers Distributes tensor operations (matrix multiplication) across devices through by tiling across devices. Used in T5.
- Huang et al 2018 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Partitions groups of layers across different accelerators in a pipeline, resulting in linear speedup across multiple accelerators. Works because not much communication overhead between layers. Megatron-Turning NLG uses this (page 6) along with other methods.
- Shoeybi et al 2019 - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Proposes a particular split across devices for both the feedforward and attention layers in the Transformer. Used to train a 8.3B LM. blog github
Software
- Mesh-Tensorflow
- Megatron-LM Distributed training for language models. Used for example by Bloom.
- HuggingFace Accelerate Easy distributed training for PyTorch models (blog post introduction) I believe this is essentially GPipe
- PyTorch Fully Sharded Data Parallel paper This is what Chris T. used for pre-training experiments
Conferences and Workshops
Related Pages
ml/distributed_training.1741670570.txt.gz · Last modified: 2025/03/11 05:22 by jmflanig