User Tools

Site Tools


ml:distributed_training

This is an old revision of the document!


Distributed Training

Overviews

Papers

Distributed Optimization

Used as the optimizer in both data-parallel and model-parallel training (usually the method can be used for both).

Data Parallel

Model Parallel (or a combination of model + data parallel)

Pipeline parallel, which partitions groups of layers across different accelerators in a pipeline, can be thought of as a special case of model parallel. This is used in GPipe, etc. Tensor slicing (or tensor parallel) is another type of model parallelization, used in Megatron-LM. Megatron-Turning NLG uses both types.

Software

Conferences and Workshops

ml/distributed_training.1741670365.txt.gz · Last modified: 2025/03/11 05:19 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki