User Tools

Site Tools


ml:gpu_deep_learning

GPU Deep Learning

Overviews

Details of Deep Learning on GPUs

Parallelism on GPUs

Summary of parallelism across devices from Megatron-LM Training Multi-Billion Parameter Language Models Using Model Parallelism:

There are two central paradigms for scaling out deep neural network training to numerous hardware accelerators: data parallelism (Valiant, 1990) where a training minibatch is split across multiple workers, and model parallelism in which the memory usage and computation of a model is distributed across multiple workers. By increasing the minibatch size proportionally to the number of available workers (i.e. weak scaling), one observes near linear scaling in training data throughput. However, large batch training introduces complications into the optimization process that can result in reduced accuracy or longer time to convergence, offsetting the benefit of increased training throughput (Keskar et al., 2017). Further research (Goyal et al., 2017; You et al., 2017; 2019) has developed techniques to mitigate these effects and drive down the training time of large neural networks. To scale out training even further, parallel work (Chen et al., 2016) has combined data parallelism with activation checkpointing: recomputing activations in the backward pass without storing them in the forward pass to reduce memory requirements.

However, these techniques have one fundamental limitation in the problem size they can tackle: the model must fit entirely on one worker. With language models of increasing size and complexity like BERT and GPT-2, neural networks have approached the memory capacity of modern hardware accelerators. One solution to this problem is to employ parameter sharing to reduce the memory footprint of the model (Lan et al., 2019), but this limits the overall capacity of the model. Our approach is to utilize model parallelism to split the model across multiple accelerators. This not only alleviates the memory pressure, but also increases the amount of parallelism independently of the microbatch size.

Within model parallelism, there are two further paradigms: layer-wise pipeline parallelism, and more general distributed tensor computation. In pipeline model parallelism, groups of operations are performed on one device before the outputs are passed to the next device in the pipeline where a different group of operations are performed. Some approaches (Harlap et al., 2018; Chen et al., 2018) use a parameter server (Li et al., 2014) in conjunction with pipeline parallelism. However these suffer from inconsistency issues. The GPipe framework for TensorFlow (Huang et al., 2018) overcomes this inconsistency issue by using synchronous gradient decent. This approach requires additional logic to handle the efficient pipelining of these communication and computation operations, and suffers from pipeline bubbles that reduce efficiency, or changes to the optimizer itself which impact accuracy.

Implementation: https://github.com/NVIDIA/Megatron-LM

Memory Reduction Techniques

See also Parameter-Efficient Tuning.

Memory reduction techniques allow increasing the batch size when training neural networks, which can lead to higher parallelism and faster training.

Miscellaneous Deep Learning & GPU Papers

Miscellaneous Transformer & GPU Papers

Customized Implementations on GPUs

Resources

Software

Profiling

Conferences and Workshops

ml/gpu_deep_learning.txt · Last modified: 2025/07/17 03:25 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki