Differences

This shows you the differences between two versions of the page.

--- ml:distributed_training [2025/05/16 09:09] – jmflanig
+++ ml:distributed_training [2025/05/29 07:18] (current) – [Model Parallel (or a combination of model + data parallel)] jmflanig
@@ Line 36: / Line 36: @@
   * [[https://arxiv.org/pdf/1811.02084.pdf|Shazeer et al 2018 - Mesh-TensorFlow: Deep Learning for Supercomputers]] Distributes tensor operations (matrix multiplication) across devices through by tiling across devices. Used in T5.
   * [[https://arxiv.org/pdf/1811.06965.pdf|Huang et al 2018 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism]] Partitions groups of layers across different accelerators in a pipeline, resulting in linear speedup across multiple accelerators. Works because not much communication overhead between layers. [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turning NLG]] uses this (page 6) along with other methods.
-  * [[https://arxiv.org/pdf/1909.08053.pdf|Shoeybi et al 2019 - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism]] Proposes a particular split across devices for both the feedforward and attention layers in the Transformer. Used to train a 8.3B LM. [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog]] [[https://github.com/NVIDIA/Megatron-LM|github]]
+  * [[https://arxiv.org/pdf/1909.08053.pdf|Shoeybi et al 2019 - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism]] This paper introduced what is now called **tensor parallelism**. Proposes a particular split across devices for both the feedforward and attention layers in the Transformer. Used to train a 8.3B LM. [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog]] [[https://github.com/NVIDIA/Megatron-LM|github]]
   * [[https://arxiv.org/pdf/1910.02054.pdf|Rajbhandari et al 2019 - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models]] Allows training up to 100B LM.
   * [[https://openreview.net/pdf?id=qrwe7XHTmYb|Lepikhin et al 2020 - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding]] Trains a Sparsely-Gated Mixture-of-Experts model up to 600B LM.
@@ Line 44: / Line 44: @@
   * [[https://arxiv.org/pdf/2203.12533.pdf|Burham et al 2022 - Pathways: Asynchronous Distributed Dataflow for ML]] Used to train [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]]
   * [[https://arxiv.org/pdf/2401.10241.pdf|Qi et al 2024 - Zero Bubble Pipeline Parallelism]]
+  * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]]
+  * [[https://arxiv.org/pdf/2408.04093|Shyam et al 2024 - Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters]]
-===== Distributed Inference =====
+===== Distributed Serving (Inference) =====
   * [[https://arxiv.org/pdf/2401.02669|Lin et al 2024 - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache]]