GPU Deep Learning

Overviews

NLP 202 - Deep Learning on GPUs (start at slide 11)
LLMs
- Zhou et al 2024 - A Survey on Efficient Inference for Large Language Models

Details of Deep Learning on GPUs

NVidia Deep Learning Performance Documentation
- Introduction to Deep Learning on GPUs (key part here)
  - Pdf version: GPU Performance Background User's Guide
- Matrix Multiplication Background
- Feedforward neural networks
  - “Choose batch sizes and neuron counts greater than 128 to avoid being limited by memory bandwidth (Tesla V100)”. For GPUs with a higher FLOPS/memory bandwidth ratio, this will need to be higher.
  - “Choose the batch size and the number of inputs and outputs to be divisible by at least 64 and ideally 256”
- RNNs
- Summary of recommendations
NVidia GPU Documentation
- CUDA C++ Programming Guide html version (Has the official list of what each compute capability version means in Appendix K) If you want to know how many CUDA cores, warp schedulers, etc each SM in a GPU has, look up the GPU's compute capability version (Chapter 5 here) and then check Appendix F in this doc to find out.
- NVidia - CUDA Compatibility To find the compute cabability of a GPU, find out it's generation (Kepler, Turing, etc) and look up the compute cabability in Chapter 5 of this doc. (old version: NVidia - CUDA Compatibility, see Ch 3)
- Ampere Architecture Whitepaper: NVIDIA A100 Tensor Core GPU Architecture
- Hopper Architecture Whitepaper: NVIDIA H100 Tensor Core GPU Architecture
- PTX Instruction Set See also here
Examples of GPU performance analysis
- Shazeer 2019 (By one of the inventors of the Transformer)

Parallelism on GPUs

Summary of parallelism across devices from Megatron-LM Training Multi-Billion Parameter Language Models Using Model Parallelism:

There are two central paradigms for scaling out deep neural network training to numerous hardware accelerators: data parallelism (Valiant, 1990) where a training minibatch is split across multiple workers, and model parallelism in which the memory usage and computation of a model is distributed across multiple workers. By increasing the minibatch size proportionally to the number of available workers (i.e. weak scaling), one observes near linear scaling in training data throughput. However, large batch training introduces complications into the optimization process that can result in reduced accuracy or longer time to convergence, offsetting the benefit of increased training throughput (Keskar et al., 2017). Further research (Goyal et al., 2017; You et al., 2017; 2019) has developed techniques to mitigate these effects and drive down the training time of large neural networks. To scale out training even further, parallel work (Chen et al., 2016) has combined data parallelism with activation checkpointing: recomputing activations in the backward pass without storing them in the forward pass to reduce memory requirements.

However, these techniques have one fundamental limitation in the problem size they can tackle: the model must fit entirely on one worker. With language models of increasing size and complexity like BERT and GPT-2, neural networks have approached the memory capacity of modern hardware accelerators. One solution to this problem is to employ parameter sharing to reduce the memory footprint of the model (Lan et al., 2019), but this limits the overall capacity of the model. Our approach is to utilize model parallelism to split the model across multiple accelerators. This not only alleviates the memory pressure, but also increases the amount of parallelism independently of the microbatch size.

Within model parallelism, there are two further paradigms: layer-wise pipeline parallelism, and more general distributed tensor computation. In pipeline model parallelism, groups of operations are performed on one device before the outputs are passed to the next device in the pipeline where a different group of operations are performed. Some approaches (Harlap et al., 2018; Chen et al., 2018) use a parameter server (Li et al., 2014) in conjunction with pipeline parallelism. However these suffer from inconsistency issues. The GPipe framework for TensorFlow (Huang et al., 2018) overcomes this inconsistency issue by using synchronous gradient decent. This approach requires additional logic to handle the efficient pipelining of these communication and computation operations, and suffers from pipeline bubbles that reduce efficiency, or changes to the optimizer itself which impact accuracy.

Implementation: https://github.com/NVIDIA/Megatron-LM

Memory Reduction Techniques

Miscellaneous Deep Learning & GPU Papers

Miscellaneous Transformer & GPU Papers

Customized Implementations on GPUs

Performance of a neural network on GPUs can be improved by fusing sequential operations together by writing custom CUDA kernels for the forward and backward operations. For an example, see Ivanov et al 2020 - Data Movement Is All You Need: A Case Study on Optimizing Transformers.

Resources

NLP 202 - Deep Learning on GPUs (start at slide 11)
Sasha Rush's GPU Puzzles - Exercises for learning to write GPU kernels in NUMBA

Software

HuggingFace Accelerate Allows using multiple GPUs for training PyTorch models (blog post introduction)
NVidia's BERT for PyTorch An optimized version of HuggingFace Transformers
NVidia's TensorRT-LLM Transformer library for LLMs, current as of 2024
TVM: Chen et al 2018 - TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Profiling

Conferences and Workshops

MLSys

NLP Wiki

Table of Contents