ml:distributed_training

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:distributed_training [2025/03/11 05:23] – [Overviews] jmflanigml:distributed_training [2025/05/29 07:18] (current) – [Model Parallel (or a combination of model + data parallel)] jmflanig
Line 1: Line 1:
-====== Distributed Training ======+====== Distributed Training and Inference ======
  
 ===== Overviews ===== ===== Overviews =====
   * Concise summary in the introduction and related work here: [[https://arxiv.org/pdf/1704.05021.pdf|Aji 2017]]   * Concise summary in the introduction and related work here: [[https://arxiv.org/pdf/1704.05021.pdf|Aji 2017]]
   * For a modern overview, see section 3.4 of [[https://arxiv.org/pdf/2401.02038|Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference]]   * For a modern overview, see section 3.4 of [[https://arxiv.org/pdf/2401.02038|Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference]]
 +  * A good overview is in section 3.3.2 of the [[https://arxiv.org/pdf/2407.21783|The Llama 3 technical report]] Uses tensor parallelism, pipeline parallelism, context parallelism and data parallelism.
 +  * [[https://arxiv.org/pdf/1912.09789|Verbraeken et al 2019 - A Survey on Distributed Machine Learning]]
 +  * [[https://arxiv.org/pdf/1810.11787|Chaha et al 2018 - A Hitchhiker’s Guide On Distributed Training of Deep Neural Networks]] Older, but thorough at the time
   * **Targeted Overviews**   * **Targeted Overviews**
-    * [[https://arxiv.org/pdf/2003.03009|Ouyang & Dong 2020 - Communication optimization strategies for distributed deep neural network training: A survey]]+    * **[[https://arxiv.org/pdf/2003.03009|Ouyang & Dong 2020 - Communication optimization strategies for distributed deep neural network training: A survey]]**
     * [[https://arxiv.org/pdf/2205.11913|Gao et al 2022 - Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision]]     * [[https://arxiv.org/pdf/2205.11913|Gao et al 2022 - Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision]]
     * [[https://arxiv.org/pdf/2411.05614|Liu et al 2024 - Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey]]     * [[https://arxiv.org/pdf/2411.05614|Liu et al 2024 - Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey]]
 +    * [[https://arxiv.org/pdf/2404.06114|Liang et al 2024 - Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey]]
   * **Blog posts**   * **Blog posts**
     * [[http://www.juyang.co/distributed-model-training-ii-parameter-server-and-allreduce/|Parameter server]] Warning: contains some conceptual errors     * [[http://www.juyang.co/distributed-model-training-ii-parameter-server-and-allreduce/|Parameter server]] Warning: contains some conceptual errors
Line 32: Line 36:
   * [[https://arxiv.org/pdf/1811.02084.pdf|Shazeer et al 2018 - Mesh-TensorFlow: Deep Learning for Supercomputers]] Distributes tensor operations (matrix multiplication) across devices through by tiling across devices. Used in T5.   * [[https://arxiv.org/pdf/1811.02084.pdf|Shazeer et al 2018 - Mesh-TensorFlow: Deep Learning for Supercomputers]] Distributes tensor operations (matrix multiplication) across devices through by tiling across devices. Used in T5.
   * [[https://arxiv.org/pdf/1811.06965.pdf|Huang et al 2018 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism]] Partitions groups of layers across different accelerators in a pipeline, resulting in linear speedup across multiple accelerators. Works because not much communication overhead between layers. [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turning NLG]] uses this (page 6) along with other methods.   * [[https://arxiv.org/pdf/1811.06965.pdf|Huang et al 2018 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism]] Partitions groups of layers across different accelerators in a pipeline, resulting in linear speedup across multiple accelerators. Works because not much communication overhead between layers. [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turning NLG]] uses this (page 6) along with other methods.
-  * [[https://arxiv.org/pdf/1909.08053.pdf|Shoeybi et al 2019 - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism]] Proposes a particular split across devices for both the feedforward and attention layers in the Transformer. Used to train a 8.3B LM. [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog]] [[https://github.com/NVIDIA/Megatron-LM|github]] +  * [[https://arxiv.org/pdf/1909.08053.pdf|Shoeybi et al 2019 - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism]] This paper introduced what is now called **tensor parallelism**. Proposes a particular split across devices for both the feedforward and attention layers in the Transformer. Used to train a 8.3B LM. [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog]] [[https://github.com/NVIDIA/Megatron-LM|github]] 
-  * [[https://arxiv.org/pdf/1910.02054.pdf|Rajbhandari et al 2019 - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models]] +  * [[https://arxiv.org/pdf/1910.02054.pdf|Rajbhandari et al 2019 - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models]] Allows training up to 100B LM. 
-  * [[https://openreview.net/pdf?id=qrwe7XHTmYb|Lepikhin et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]]+  * [[https://openreview.net/pdf?id=qrwe7XHTmYb|Lepikhin et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]] Trains a Sparsely-Gated Mixture-of-Experts model up to 600B LM.
   * [[https://arxiv.org/pdf/2101.06840.pdf|Ren et al 2021 - ZeRO-Offload: Democratizing Billion-Scale Model Training]] [[https://www.deepspeed.ai/tutorials/zero-offload/|Code and tutorial]]   * [[https://arxiv.org/pdf/2101.06840.pdf|Ren et al 2021 - ZeRO-Offload: Democratizing Billion-Scale Model Training]] [[https://www.deepspeed.ai/tutorials/zero-offload/|Code and tutorial]]
   * [[https://arxiv.org/pdf/2104.07857.pdf|Rajbhandari et al 2021 - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning]]   * [[https://arxiv.org/pdf/2104.07857.pdf|Rajbhandari et al 2021 - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning]]
   * [[https://arxiv.org/pdf/2201.12023.pdf|Zheng et al 2022 - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning]]   * [[https://arxiv.org/pdf/2201.12023.pdf|Zheng et al 2022 - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning]]
   * [[https://arxiv.org/pdf/2203.12533.pdf|Burham et al 2022 - Pathways: Asynchronous Distributed Dataflow for ML]] Used to train [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]]   * [[https://arxiv.org/pdf/2203.12533.pdf|Burham et al 2022 - Pathways: Asynchronous Distributed Dataflow for ML]] Used to train [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]]
-  * [[https://arxiv.org/pdf/2401.10241.pdf|Qi et al 2023 - Zero Bubble Pipeline Parallelism]]+  * [[https://arxiv.org/pdf/2401.10241.pdf|Qi et al 2024 - Zero Bubble Pipeline Parallelism]] 
 +  * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]] 
 +  * [[https://arxiv.org/pdf/2408.04093|Shyam et al 2024 - Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters]]
  
 +===== Distributed Serving (Inference) =====
 +  * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]]
 +
 +===== Network (Design and Topology) =====
 +  * [[https://arxiv.org/pdf/2202.00433|Wang et al 2022 - TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs]]
  
 ===== Software ===== ===== Software =====
ml/distributed_training.1741670595.txt.gz · Last modified: 2025/03/11 05:23 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki