User Tools

Site Tools


ml:distributed_training

Distributed Training and Inference

Overviews

Papers

Distributed Optimization

Used as the optimizer in both data-parallel and model-parallel training (usually the method can be used for both).

Data Parallel

Model Parallel (or a combination of model + data parallel)

Pipeline parallel, which partitions groups of layers across different accelerators in a pipeline, can be thought of as a special case of model parallel. This is used in GPipe, etc. Tensor slicing (or tensor parallel) is another type of model parallelization, used in Megatron-LM. Megatron-Turning NLG uses both types.

Distributed Serving (Inference)

Network (Design and Topology)

Software

Conferences and Workshops

ml/distributed_training.txt · Last modified: 2025/05/29 07:18 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki