Table of Contents

Distributed Training and Inference

Overviews

Papers

Distributed Optimization

Used as the optimizer in both data-parallel and model-parallel training (usually the method can be used for both).

Data Parallel

Model Parallel (or a combination of model + data parallel)

Pipeline parallel, which partitions groups of layers across different accelerators in a pipeline, can be thought of as a special case of model parallel. This is used in GPipe, etc. Tensor slicing (or tensor parallel) is another type of model parallelization, used in Megatron-LM. Megatron-Turning NLG uses both types.

Distributed Serving (Inference)

Network (Design and Topology)

Software

Conferences and Workshops