Table of Contents
GPU Deep Learning
Overviews
- NLP 202 - Deep Learning on GPUs (start at slide 11)
- LLMs
Details of Deep Learning on GPUs
-
- Introduction to Deep Learning on GPUs (key part here)
- Pdf version: GPU Performance Background User's Guide
-
- “Choose batch sizes and neuron counts greater than 128 to avoid being limited by memory bandwidth (Tesla V100)”. For GPUs with a higher FLOPS/memory bandwidth ratio, this will need to be higher.
- “Choose the batch size and the number of inputs and outputs to be divisible by at least 64 and ideally 256”
- NVidia GPU Documentation
- CUDA C++ Programming Guide html version (Has the official list of what each compute capability version means in Appendix K) If you want to know how many CUDA cores, warp schedulers, etc each SM in a GPU has, look up the GPU's compute capability version (Chapter 5 here) and then check Appendix F in this doc to find out.
- NVidia - CUDA Compatibility To find the compute cabability of a GPU, find out it's generation (Kepler, Turing, etc) and look up the compute cabability in Chapter 5 of this doc. (old version: NVidia - CUDA Compatibility, see Ch 3)
- Ampere Architecture Whitepaper: NVIDIA A100 Tensor Core GPU Architecture
- Hopper Architecture Whitepaper: NVIDIA H100 Tensor Core GPU Architecture
- PTX Instruction Set See also here
- Examples of GPU performance analysis
- Shazeer 2019 (By one of the inventors of the Transformer)
Parallelism on GPUs
Summary of parallelism across devices from Megatron-LM Training Multi-Billion Parameter Language Models Using Model Parallelism:
There are two central paradigms for scaling out deep neural network training to numerous hardware accelerators: data parallelism (Valiant, 1990) where a training minibatch is split across multiple workers, and model parallelism in which the memory usage and computation of a model is distributed across multiple workers. By increasing the minibatch size proportionally to the number of available workers (i.e. weak scaling), one observes near linear scaling in training data throughput. However, large batch training introduces complications into the optimization process that can result in reduced accuracy or longer time to convergence, offsetting the benefit of increased training throughput (Keskar et al., 2017). Further research (Goyal et al., 2017; You et al., 2017; 2019) has developed techniques to mitigate these effects and drive down the training time of large neural networks. To scale out training even further, parallel work (Chen et al., 2016) has combined data parallelism with activation checkpointing: recomputing activations in the backward pass without storing them in the forward pass to reduce memory requirements.
However, these techniques have one fundamental limitation in the problem size they can tackle: the model must fit entirely on one worker. With language models of increasing size and complexity like BERT and GPT-2, neural networks have approached the memory capacity of modern hardware accelerators. One solution to this problem is to employ parameter sharing to reduce the memory footprint of the model (Lan et al., 2019), but this limits the overall capacity of the model. Our approach is to utilize model parallelism to split the model across multiple accelerators. This not only alleviates the memory pressure, but also increases the amount of parallelism independently of the microbatch size.
Within model parallelism, there are two further paradigms: layer-wise pipeline parallelism, and more general distributed tensor computation. In pipeline model parallelism, groups of operations are performed on one device before the outputs are passed to the next device in the pipeline where a different group of operations are performed. Some approaches (Harlap et al., 2018; Chen et al., 2018) use a parameter server (Li et al., 2014) in conjunction with pipeline parallelism. However these suffer from inconsistency issues. The GPipe framework for TensorFlow (Huang et al., 2018) overcomes this inconsistency issue by using synchronous gradient decent. This approach requires additional logic to handle the efficient pipelining of these communication and computation operations, and suffers from pipeline bubbles that reduce efficiency, or changes to the optimizer itself which impact accuracy.
Implementation: https://github.com/NVIDIA/Megatron-LM
Memory Reduction Techniques
See also Parameter-Efficient Tuning.
Memory reduction techniques allow increasing the batch size when training neural networks, which can lead to higher parallelism and faster training.
- Overviews
- Papers
- Gradient Checkpointing aka Activation Checkpointing: Chen et al 2016 - Training Deep Nets with Sublinear Memory Cost
- Implemented in pytorch in torch.utils.checkpoint: Checkpointing
- Beaumont et al 2019 - Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory This paper has an optimal checkpointing algorithm. Calls it “activation checkpointing”.
- Computing the forward gradient instead of using backprop would allow you to reduce the memory cost of the computation graph (don't need to keep nodes that won't be used later). See this paper on forward gradient: Baydin et al 2022 - Gradients without Backpropagation github
- Huang et al 2019 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism “Allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently.” Jeff (2024): I believe this is what HuggingFace uses to train on multiple GPUs
- Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Like LoRA, but can be used for pre-training as well
Miscellaneous Deep Learning & GPU Papers
Miscellaneous Transformer & GPU Papers
- Overviews
Customized Implementations on GPUs
Performance of a neural network on GPUs can be improved by fusing sequential operations together by writing custom CUDA kernels for the forward and backward operations. For an example, see Ivanov et al 2020 - Data Movement Is All You Need: A Case Study on Optimizing Transformers.
- Nrusimha et al 2025 - FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference Fuses everything into one big kernel
Resources
- NLP 202 - Deep Learning on GPUs (start at slide 11)
- Sasha Rush's GPU Puzzles - Exercises for learning to write GPU kernels in NUMBA
Software
- HuggingFace Accelerate Allows using multiple GPUs for training PyTorch models (blog post introduction)
- NVidia's BERT for PyTorch An optimized version of HuggingFace Transformers
- NVidia's TensorRT-LLM Transformer library for LLMs, current as of 2024