====== Distributed Training and Inference ======

===== Overviews =====
  * Concise summary in the introduction and related work here: [[https://arxiv.org/pdf/1704.05021.pdf|Aji 2017]]
  * For a modern overview, see section 3.4 of [[https://arxiv.org/pdf/2401.02038|Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference]]
  * A good overview is in section 3.3.2 of the [[https://arxiv.org/pdf/2407.21783|The Llama 3 technical report]] Uses tensor parallelism, pipeline parallelism, context parallelism and data parallelism.
  * [[https://arxiv.org/pdf/1912.09789|Verbraeken et al 2019 - A Survey on Distributed Machine Learning]]
  * [[https://arxiv.org/pdf/1810.11787|Chaha et al 2018 - A Hitchhiker’s Guide On Distributed Training of Deep Neural Networks]] Older, but thorough at the time
  * **Targeted Overviews**
    * **[[https://arxiv.org/pdf/2003.03009|Ouyang & Dong 2020 - Communication optimization strategies for distributed deep neural network training: A survey]]**
    * [[https://arxiv.org/pdf/2205.11913|Gao et al 2022 - Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision]]
    * [[https://arxiv.org/pdf/2411.05614|Liu et al 2024 - Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey]]
    * [[https://arxiv.org/pdf/2404.06114|Liang et al 2024 - Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey]]
  * **Blog posts**
    * [[http://www.juyang.co/distributed-model-training-ii-parameter-server-and-allreduce/|Parameter server]] Warning: contains some conceptual errors
    * [[https://medium.com/nerd-for-tech/an-overview-of-pipeline-parallelism-and-its-research-progress-7934e5e6d5b8|An Overview of Pipeline Parallelism and its Research Progress]]

===== Papers =====

==== Distributed Optimization ====
Used as the optimizer in both data-parallel and model-parallel training (usually the method can be used for both).

  * [[https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf|Parameter server]]
  * [[https://arxiv.org/pdf/1106.5730.pdf|Hogwild]]
  * [[https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140694.pdf|Seide et al 2014 - 1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs]]
  * [[https://arxiv.org/pdf/1704.05021.pdf|Aji & Heafield 2017 - Sparse Communication for Distributed Gradient Descent]] Has a good summary of related work
  * LAMB optimizer: [[https://arxiv.org/pdf/1904.00962.pdf|You et al 2020 - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes]] Distributed on a single TPU Pod.  Reduces training time from 3 days to 76 minutes on a TPU Pod.

==== Data Parallel =====
  * [[https://arxiv.org/pdf/2006.15704|Li et al 2020 - PyTorch Distributed: Experiences on Accelerating Data Parallel Training]]

==== Model Parallel (or a combination of model + data parallel) ====
**Pipeline parallel**, which partitions groups of layers across different accelerators in a pipeline, can be thought of as a special case of model parallel.  This is used in GPipe, etc.  **Tensor slicing (or tensor parallel)** is another type of model parallelization, used in [[https://arxiv.org/pdf/1909.08053.pdf|Megatron-LM]]. [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turning NLG]] uses both types.

  * [[https://arxiv.org/pdf/1404.5997.pdf|Krizhevsky 2014 - One weird trick for parallelizing convolutional neural networks]]
  * [[https://arxiv.org/pdf/1811.02084.pdf|Shazeer et al 2018 - Mesh-TensorFlow: Deep Learning for Supercomputers]] Distributes tensor operations (matrix multiplication) across devices through by tiling across devices. Used in T5.
  * [[https://arxiv.org/pdf/1811.06965.pdf|Huang et al 2018 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism]] Partitions groups of layers across different accelerators in a pipeline, resulting in linear speedup across multiple accelerators. Works because not much communication overhead between layers. [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turning NLG]] uses this (page 6) along with other methods.
  * [[https://arxiv.org/pdf/1909.08053.pdf|Shoeybi et al 2019 - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism]] This paper introduced what is now called **tensor parallelism**. Proposes a particular split across devices for both the feedforward and attention layers in the Transformer. Used to train a 8.3B LM. [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog]] [[https://github.com/NVIDIA/Megatron-LM|github]]
  * [[https://arxiv.org/pdf/1910.02054.pdf|Rajbhandari et al 2019 - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models]] Allows training up to 100B LM.
  * [[https://openreview.net/pdf?id=qrwe7XHTmYb|Lepikhin et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]] Trains a Sparsely-Gated Mixture-of-Experts model up to 600B LM.
  * [[https://arxiv.org/pdf/2101.06840.pdf|Ren et al 2021 - ZeRO-Offload: Democratizing Billion-Scale Model Training]] [[https://www.deepspeed.ai/tutorials/zero-offload/|Code and tutorial]]
  * [[https://arxiv.org/pdf/2104.07857.pdf|Rajbhandari et al 2021 - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning]]
  * [[https://arxiv.org/pdf/2201.12023.pdf|Zheng et al 2022 - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning]]
  * [[https://arxiv.org/pdf/2203.12533.pdf|Burham et al 2022 - Pathways: Asynchronous Distributed Dataflow for ML]] Used to train [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]]
  * [[https://arxiv.org/pdf/2401.10241.pdf|Qi et al 2024 - Zero Bubble Pipeline Parallelism]]
  * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]]
  * [[https://arxiv.org/pdf/2408.04093|Shyam et al 2024 - Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters]]

===== Distributed Serving (Inference) =====
  * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]]

===== Network (Design and Topology) =====
  * [[https://arxiv.org/pdf/2202.00433|Wang et al 2022 - TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs]]

===== Software =====
  * [[https://arxiv.org/pdf/2006.15704|PyTorch Distributed]]
  * [[https://www.deepspeed.ai/|DeepSpeed]], [[https://github.com/microsoft/deepspeed|Github]]  Highly optimized distributed training for PyTorch. See their related papers like ZeRO-Offload
  * Mesh-Tensorflow
  * [[https://github.com/NVIDIA/Megatron-LM|Megatron-LM]] Distributed training for language models. Used for example by [[https://huggingface.co/docs/transformers/model_doc/bloom|Bloom]].
  * [[https://github.com/huggingface/accelerate|HuggingFace Accelerate]] Easy distributed training for PyTorch models ([[https://huggingface.co/blog/accelerate-library|blog post introduction]]) I believe this is essentially [[https://proceedings.neurips.cc/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf|GPipe]]
  * [[https://alpa.ai/|Alpa]] [[https://arxiv.org/pdf/2201.12023.pdf|paper]] [[https://sites.google.com/view/icml-2022-big-model|ICML 2022 Tutorial]] Developed at Berkeley
  * [[https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/|PyTorch Fully Sharded Data Parallel]] [[https://arxiv.org/pdf/2304.11277.pdf|paper]] This is what Chris T. used for pre-training experiments

===== Conferences and Workshops =====
  * [[https://mlsys.org/|MLSys]]

===== Related Pages =====
  * [[Large-Scale|Large-Scale ML]]
  * [[GPU Deep Learning]]
  * [[Systems & ML]]