====== Distributed Training and Inference ====== ===== Overviews ===== * Concise summary in the introduction and related work here: [[https://arxiv.org/pdf/1704.05021.pdf|Aji 2017]] * For a modern overview, see section 3.4 of [[https://arxiv.org/pdf/2401.02038|Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference]] * A good overview is in section 3.3.2 of the [[https://arxiv.org/pdf/2407.21783|The Llama 3 technical report]] Uses tensor parallelism, pipeline parallelism, context parallelism and data parallelism. * [[https://arxiv.org/pdf/1912.09789|Verbraeken et al 2019 - A Survey on Distributed Machine Learning]] * [[https://arxiv.org/pdf/1810.11787|Chaha et al 2018 - A Hitchhiker’s Guide On Distributed Training of Deep Neural Networks]] Older, but thorough at the time * **Targeted Overviews** * **[[https://arxiv.org/pdf/2003.03009|Ouyang & Dong 2020 - Communication optimization strategies for distributed deep neural network training: A survey]]** * [[https://arxiv.org/pdf/2205.11913|Gao et al 2022 - Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision]] * [[https://arxiv.org/pdf/2411.05614|Liu et al 2024 - Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey]] * [[https://arxiv.org/pdf/2404.06114|Liang et al 2024 - Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey]] * **Blog posts** * [[http://www.juyang.co/distributed-model-training-ii-parameter-server-and-allreduce/|Parameter server]] Warning: contains some conceptual errors * [[https://medium.com/nerd-for-tech/an-overview-of-pipeline-parallelism-and-its-research-progress-7934e5e6d5b8|An Overview of Pipeline Parallelism and its Research Progress]] ===== Papers ===== ==== Distributed Optimization ==== Used as the optimizer in both data-parallel and model-parallel training (usually the method can be used for both). * [[https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf|Parameter server]] * [[https://arxiv.org/pdf/1106.5730.pdf|Hogwild]] * [[https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140694.pdf|Seide et al 2014 - 1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs]] * [[https://arxiv.org/pdf/1704.05021.pdf|Aji & Heafield 2017 - Sparse Communication for Distributed Gradient Descent]] Has a good summary of related work * LAMB optimizer: [[https://arxiv.org/pdf/1904.00962.pdf|You et al 2020 - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes]] Distributed on a single TPU Pod. Reduces training time from 3 days to 76 minutes on a TPU Pod. ==== Data Parallel ===== * [[https://arxiv.org/pdf/2006.15704|Li et al 2020 - PyTorch Distributed: Experiences on Accelerating Data Parallel Training]] ==== Model Parallel (or a combination of model + data parallel) ==== **Pipeline parallel**, which partitions groups of layers across different accelerators in a pipeline, can be thought of as a special case of model parallel. This is used in GPipe, etc. **Tensor slicing (or tensor parallel)** is another type of model parallelization, used in [[https://arxiv.org/pdf/1909.08053.pdf|Megatron-LM]]. [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turning NLG]] uses both types. * [[https://arxiv.org/pdf/1404.5997.pdf|Krizhevsky 2014 - One weird trick for parallelizing convolutional neural networks]] * [[https://arxiv.org/pdf/1811.02084.pdf|Shazeer et al 2018 - Mesh-TensorFlow: Deep Learning for Supercomputers]] Distributes tensor operations (matrix multiplication) across devices through by tiling across devices. Used in T5. * [[https://arxiv.org/pdf/1811.06965.pdf|Huang et al 2018 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism]] Partitions groups of layers across different accelerators in a pipeline, resulting in linear speedup across multiple accelerators. Works because not much communication overhead between layers. [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turning NLG]] uses this (page 6) along with other methods. * [[https://arxiv.org/pdf/1909.08053.pdf|Shoeybi et al 2019 - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism]] This paper introduced what is now called **tensor parallelism**. Proposes a particular split across devices for both the feedforward and attention layers in the Transformer. Used to train a 8.3B LM. [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog]] [[https://github.com/NVIDIA/Megatron-LM|github]] * [[https://arxiv.org/pdf/1910.02054.pdf|Rajbhandari et al 2019 - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models]] Allows training up to 100B LM. * [[https://openreview.net/pdf?id=qrwe7XHTmYb|Lepikhin et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]] Trains a Sparsely-Gated Mixture-of-Experts model up to 600B LM. * [[https://arxiv.org/pdf/2101.06840.pdf|Ren et al 2021 - ZeRO-Offload: Democratizing Billion-Scale Model Training]] [[https://www.deepspeed.ai/tutorials/zero-offload/|Code and tutorial]] * [[https://arxiv.org/pdf/2104.07857.pdf|Rajbhandari et al 2021 - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning]] * [[https://arxiv.org/pdf/2201.12023.pdf|Zheng et al 2022 - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning]] * [[https://arxiv.org/pdf/2203.12533.pdf|Burham et al 2022 - Pathways: Asynchronous Distributed Dataflow for ML]] Used to train [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]] * [[https://arxiv.org/pdf/2401.10241.pdf|Qi et al 2024 - Zero Bubble Pipeline Parallelism]] * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]] * [[https://arxiv.org/pdf/2408.04093|Shyam et al 2024 - Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters]] ===== Distributed Serving (Inference) ===== * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]] ===== Network (Design and Topology) ===== * [[https://arxiv.org/pdf/2202.00433|Wang et al 2022 - TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs]] ===== Software ===== * [[https://arxiv.org/pdf/2006.15704|PyTorch Distributed]] * [[https://www.deepspeed.ai/|DeepSpeed]], [[https://github.com/microsoft/deepspeed|Github]] Highly optimized distributed training for PyTorch. See their related papers like ZeRO-Offload * Mesh-Tensorflow * [[https://github.com/NVIDIA/Megatron-LM|Megatron-LM]] Distributed training for language models. Used for example by [[https://huggingface.co/docs/transformers/model_doc/bloom|Bloom]]. * [[https://github.com/huggingface/accelerate|HuggingFace Accelerate]] Easy distributed training for PyTorch models ([[https://huggingface.co/blog/accelerate-library|blog post introduction]]) I believe this is essentially [[https://proceedings.neurips.cc/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf|GPipe]] * [[https://alpa.ai/|Alpa]] [[https://arxiv.org/pdf/2201.12023.pdf|paper]] [[https://sites.google.com/view/icml-2022-big-model|ICML 2022 Tutorial]] Developed at Berkeley * [[https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/|PyTorch Fully Sharded Data Parallel]] [[https://arxiv.org/pdf/2304.11277.pdf|paper]] This is what Chris T. used for pre-training experiments ===== Conferences and Workshops ===== * [[https://mlsys.org/|MLSys]] ===== Related Pages ===== * [[Large-Scale|Large-Scale ML]] * [[GPU Deep Learning]] * [[Systems & ML]]