====== GPU Deep Learning ======

===== Overviews =====
  * [[https://classes.soe.ucsc.edu/nlp202/Winter22/slides/gpu-dl.pdf|NLP 202 - Deep Learning on GPUs]] (start at slide 11)
  * **LLMs**
    * [[https://arxiv.org/pdf/2404.14294|Zhou et al 2024 - A Survey on Efficient Inference for Large Language Models]]

===== Details of Deep Learning on GPUs =====
  * [[https://docs.nvidia.com/deeplearning/performance/index.html|NVidia Deep Learning Performance Documentation]]
     * **[[https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html|Introduction to Deep Learning on GPUs]]** (key part [[https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf|here]])
       * Pdf version: [[https://docs.nvidia.com/deeplearning/performance/pdf/GPU-Performance-Background-User-Guide.pdf|GPU Performance Background User's Guide]]
     * [[https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html|Matrix Multiplication Background]]
     * [[https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html|Feedforward neural networks]]
        * "Choose batch sizes and neuron counts greater than 128 to avoid being limited by memory bandwidth (Tesla V100)".  For GPUs with a higher FLOPS/memory bandwidth ratio, this will need to be higher.
        * "Choose the batch size and the number of inputs and outputs to be divisible by at least 64 and ideally 256"
     * [[https://docs.nvidia.com/deeplearning/performance/dl-performance-recurrent/index.html#recurrent-layer|RNNs]]
     * [[https://docs.nvidia.com/deeplearning/performance/dl-performance-getting-started/index.html|Summary of recommendations]]
   * NVidia GPU Documentation
     * [[https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf|CUDA C++ Programming Guide]] [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/|html version]] (Has the official list of what each compute capability version means in Appendix K)  If you want to know how many CUDA cores, warp schedulers, etc each SM in a GPU has, look up the GPU's compute capability version (Chapter 5 [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|here]]) and then check Appendix F in this doc to find out.
     * [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]] To find the compute cabability of a GPU, find out it's generation (Kepler, Turing, etc) and look up the compute cabability in Chapter 5 of this doc. (old version: [[https://web.archive.org/web/20220808074234/https://docs.nvidia.com/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]], see Ch 3)
    * Ampere Architecture Whitepaper: [[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf|NVIDIA A100 Tensor Core GPU Architecture]]
    * Hopper Architecture Whitepaper: [[https://www.advancedclustering.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf|NVIDIA H100 Tensor Core GPU Architecture]]
    * [[https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf|PTX Instruction Set]] See also [[https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#|here]]
   * Examples of GPU performance analysis
     * [[https://arxiv.org/pdf/1911.02150.pdf|Shazeer 2019]] (By one of the inventors of the Transformer)

===== Parallelism on GPUs =====
Summary of parallelism across devices from [[paper:Megatron-LM Training Multi-Billion Parameter Language Models Using
Model Parallelism]]:

<blockquote>
There are two central paradigms for scaling out deep neural network training to numerous hardware accelerators:
data parallelism (Valiant, 1990) where a training minibatch
is split across multiple workers, and model parallelism in
which the memory usage and computation of a model is
distributed across multiple workers. By increasing the minibatch size proportionally to the number of available workers (i.e. weak scaling), one observes near linear scaling
in training data throughput. However, large batch training introduces complications into the optimization process
that can result in reduced accuracy or longer time to convergence, offsetting the benefit of increased training throughput
(Keskar et al., 2017). Further research (Goyal et al., 2017;
You et al., 2017; 2019) has developed techniques to mitigate these effects and drive down the training time of large
neural networks. To scale out training even further, parallel
work (Chen et al., 2016) has combined data parallelism with
activation checkpointing: recomputing activations in the
backward pass without storing them in the forward pass to
reduce memory requirements.

However, these techniques have one fundamental limitation
in the problem size they can tackle: the model must fit
entirely on one worker. With language models of increasing
size and complexity like BERT and GPT-2, neural networks
have approached the memory capacity of modern hardware
accelerators. One solution to this problem is to employ
parameter sharing to reduce the memory footprint of the
model (Lan et al., 2019), but this limits the overall capacity
of the model. Our approach is to utilize model parallelism
to split the model across multiple accelerators. This not
only alleviates the memory pressure, but also increases the
amount of parallelism independently of the microbatch size.

Within model parallelism, there are two further paradigms:
layer-wise pipeline parallelism, and more general distributed
tensor computation. In pipeline model parallelism, groups
of operations are performed on one device before the outputs
are passed to the next device in the pipeline where a different group of operations are performed. Some approaches
(Harlap et al., 2018; Chen et al., 2018) use a parameter
server (Li et al., 2014) in conjunction with pipeline parallelism. However these suffer from inconsistency issues.
The GPipe framework for TensorFlow (Huang et al., 2018)
overcomes this inconsistency issue by using synchronous
gradient decent. This approach requires additional logic to
handle the efficient pipelining of these communication and
computation operations, and suffers from pipeline bubbles
that reduce efficiency, or changes to the optimizer itself
which impact accuracy.
</blockquote>

Implementation: [[https://github.com/NVIDIA/Megatron-LM]]

===== Memory Reduction Techniques ====
See also [[ml:fine-tuning#parameter-efficient_tuning_pet|Parameter-Efficient Tuning]].

Memory reduction techniques allow increasing the batch size when training neural networks, which can lead to higher parallelism and faster training.
  * Overviews
    * [[https://arxiv.org/pdf/1904.10631.pdf|Sohoni et al 2019 - Low-Memory Neural Network Training: A Technical Report]]
  * Papers
    * **Gradient Checkpointing aka Activation Checkpointing**: [[https://arxiv.org/pdf/1604.06174.pdf|Chen et al 2016 - Training Deep Nets with Sublinear Memory Cost]]
      * Implemented in pytorch in torch.utils.checkpoint: [[https://pytorch.org/docs/stable/checkpoint.html|Checkpointing]]
      * [[https://arxiv.org/pdf/1911.13214|Beaumont et al 2019 - Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory]] This paper has an optimal checkpointing algorithm. Calls it "activation checkpointing".
      * Computing the forward gradient instead of using backprop would allow you to reduce the memory cost of the computation graph (don't need to keep nodes that won't be used later).  See this paper on forward gradient: [[https://arxiv.org/pdf/2202.08587.pdf|Baydin et al 2022 - Gradients without Backpropagation]] [[https://github.com/orobix/fwdgrad|github]]
    * [[https://arxiv.org/pdf/1910.02653.pdf|Jain et al 2019 - Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization]]
    * [[https://proceedings.neurips.cc/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf|Huang et al 2019 - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism]] "Allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently."  Jeff (2024): I believe this is what HuggingFace uses to train on multiple GPUs
    * [[https://arxiv.org/pdf/2309.08708.pdf|Williams & Aletras 2023 - Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning]]
  * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Like LoRA, but can be used for pre-training as well

===== Miscellaneous Deep Learning & GPU Papers =====
  * [[https://arxiv.org/pdf/1710.03740|Micikevicius et al 2017 - Mixed Precision Training]]
  * [[https://www.tensorflow.org/xla|2019 - XLA: Optimizing Compiler for Machine Learning]]
  * [[https://arxiv.org/pdf/2101.06840.pdf|Ren et al 2021 - ZeRO-Offload: Democratizing Billion-Scale Model Training]]

==== Miscellaneous Transformer & GPU Papers ====
  * **Overviews**
    * [[https://arxiv.org/pdf/2404.14294|Zhou et al 2024 - A Survey on Efficient Inference for Large Language Models]]
  * [[https://arxiv.org/pdf/2309.06180|Kwon et al 2023 - Efficient Memory Management for Large Language Model Serving with PagedAttention]]
  * [[https://arxiv.org/pdf/2205.05198|Korthikanti et al 2022 - Reducing Activation Recomputation in Large Transformer Models]]
  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]

===== Customized Implementations on GPUs =====
Performance of a neural network on GPUs can be improved by fusing sequential operations together by writing custom CUDA kernels for the forward and backward operations.  For an example, see [[https://arxiv.org/pdf/2007.00072.pdf|Ivanov et al 2020 - Data Movement Is All You Need: A Case Study on Optimizing Transformers]].

  * [[https://arxiv.org/pdf/1604.01946.pdf|Appleyard et al 2016 - Optimizing Performance of Recurrent Neural Networks on GPUs]] (Paper from NVidia)
  * [[https://arxiv.org/pdf/1802.04799.pdf|Chen et al 2018 - TVM: An Automated End-to-End Optimizing Compiler for Deep Learning]]
  * [[https://on-demand.gputechconf.com/gtc-cn/2019/pdf/CN9468/presentation.pdf|Hsueh 2019 - Faster Transformer (slides)]]
  * [[https://arxiv.org/pdf/2006.16578.pdf|Li & Su 2020 - Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs]]
  * **[[https://arxiv.org/pdf/2007.00072.pdf|Ivanov et al 2020 - Data Movement Is All You Need: A Case Study on Optimizing Transformers]]**
  * [[https://arxiv.org/pdf/2010.05680.pdf|Fang et al 2020 - TurboTransformers: An Efficient GPU Serving System For Transformer Models]]
  * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]
  * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]
  * [[https://arxiv.org/pdf/2407.08608|Shah et al 2024 - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision]]
  * [[https://arxiv.org/pdf/2505.22758|Nrusimha et al 2025 - FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference]] Fuses everything into one big kernel

===== Resources =====
  * [[https://classes.soe.ucsc.edu/nlp202/Winter22/slides/gpu-dl.pdf|NLP 202 - Deep Learning on GPUs]] (start at slide 11)
  * Sasha Rush's [[https://github.com/srush/GPU-Puzzles|GPU Puzzles]] - Exercises for learning to write GPU kernels in NUMBA 

===== Software =====
  * [[https://github.com/huggingface/accelerate|HuggingFace Accelerate]] Allows using multiple GPUs for training PyTorch models ([[https://huggingface.co/blog/accelerate-library|blog post introduction]])
  * NVidia's [[https://ngc.nvidia.com/catalog/resources/nvidia:bert_for_pytorch|BERT for PyTorch]] An optimized version of HuggingFace Transformers
  * NVidia's [[https://github.com/NVIDIA/TensorRT-LLM|TensorRT-LLM]] Transformer library for LLMs, current as of 2024
  * TVM: [[https://arxiv.org/pdf/1802.04799.pdf|Chen et al 2018 - TVM: An Automated End-to-End Optimizing Compiler for Deep Learning]]

==== Profiling ====
  * [[https://developer.nvidia.com/blog/profiling-and-optimizing-deep-neural-networks-with-dlprof-and-pyprof/|Can et al 2020 - NVidia - Profiling and Optimizing Deep Neural Networks with DLProf and PyProf]]
  * [[https://www.tensorflow.org/guide/gpu_performance_analysis|Tensorflow - Optimize TensorFlow GPU Performance with the TensorFlow Profiler]]

===== Conferences and Workshops =====
  * [[https://mlsys.org/|MLSys]]

===== Related Pages =====
  * [[Distributed Training]]
  * [[Efficient NNs]]
  * [[Systems & ML]]