ml:gpu_deep_learning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:gpu_deep_learning [2025/03/25 07:48] – [Memory Reduction Techniques] jmflanigml:gpu_deep_learning [2025/07/17 03:25] (current) – [Miscellaneous Transformer & GPU Papers] jmflanig
Line 19: Line 19:
      * [[https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf|CUDA C++ Programming Guide]] [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/|html version]] (Has the official list of what each compute capability version means in Appendix K)  If you want to know how many CUDA cores, warp schedulers, etc each SM in a GPU has, look up the GPU's compute capability version (Chapter 5 [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|here]]) and then check Appendix F in this doc to find out.      * [[https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf|CUDA C++ Programming Guide]] [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/|html version]] (Has the official list of what each compute capability version means in Appendix K)  If you want to know how many CUDA cores, warp schedulers, etc each SM in a GPU has, look up the GPU's compute capability version (Chapter 5 [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|here]]) and then check Appendix F in this doc to find out.
      * [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]] To find the compute cabability of a GPU, find out it's generation (Kepler, Turing, etc) and look up the compute cabability in Chapter 5 of this doc. (old version: [[https://web.archive.org/web/20220808074234/https://docs.nvidia.com/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]], see Ch 3)      * [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]] To find the compute cabability of a GPU, find out it's generation (Kepler, Turing, etc) and look up the compute cabability in Chapter 5 of this doc. (old version: [[https://web.archive.org/web/20220808074234/https://docs.nvidia.com/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]], see Ch 3)
 +    * Ampere Architecture Whitepaper: [[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf|NVIDIA A100 Tensor Core GPU Architecture]]
 +    * Hopper Architecture Whitepaper: [[https://www.advancedclustering.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf|NVIDIA H100 Tensor Core GPU Architecture]]
 +    * [[https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf|PTX Instruction Set]] See also [[https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#|here]]
    * Examples of GPU performance analysis    * Examples of GPU performance analysis
      * [[https://arxiv.org/pdf/1911.02150.pdf|Shazeer 2019]] (By one of the inventors of the Transformer)      * [[https://arxiv.org/pdf/1911.02150.pdf|Shazeer 2019]] (By one of the inventors of the Transformer)
Line 82: Line 85:
     * **Gradient Checkpointing aka Activation Checkpointing**: [[https://arxiv.org/pdf/1604.06174.pdf|Chen et al 2016 - Training Deep Nets with Sublinear Memory Cost]]     * **Gradient Checkpointing aka Activation Checkpointing**: [[https://arxiv.org/pdf/1604.06174.pdf|Chen et al 2016 - Training Deep Nets with Sublinear Memory Cost]]
       * Implemented in pytorch in torch.utils.checkpoint: [[https://pytorch.org/docs/stable/checkpoint.html|Checkpointing]]       * Implemented in pytorch in torch.utils.checkpoint: [[https://pytorch.org/docs/stable/checkpoint.html|Checkpointing]]
-      * [[https://arxiv.org/pdf/1911.13214|Beaumont et al 2019 - Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory]] This paper has an optimal checkpointing algorithm. Calls it "activation checkpointing" (and often cited).+      * [[https://arxiv.org/pdf/1911.13214|Beaumont et al 2019 - Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory]] This paper has an optimal checkpointing algorithm. Calls it "activation checkpointing".
       * Computing the forward gradient instead of using backprop would allow you to reduce the memory cost of the computation graph (don't need to keep nodes that won't be used later).  See this paper on forward gradient: [[https://arxiv.org/pdf/2202.08587.pdf|Baydin et al 2022 - Gradients without Backpropagation]] [[https://github.com/orobix/fwdgrad|github]]       * Computing the forward gradient instead of using backprop would allow you to reduce the memory cost of the computation graph (don't need to keep nodes that won't be used later).  See this paper on forward gradient: [[https://arxiv.org/pdf/2202.08587.pdf|Baydin et al 2022 - Gradients without Backpropagation]] [[https://github.com/orobix/fwdgrad|github]]
     * [[https://arxiv.org/pdf/1910.02653.pdf|Jain et al 2019 - Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization]]     * [[https://arxiv.org/pdf/1910.02653.pdf|Jain et al 2019 - Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization]]
Line 99: Line 102:
   * [[https://arxiv.org/pdf/2309.06180|Kwon et al 2023 - Efficient Memory Management for Large Language Model Serving with PagedAttention]]   * [[https://arxiv.org/pdf/2309.06180|Kwon et al 2023 - Efficient Memory Management for Large Language Model Serving with PagedAttention]]
   * [[https://arxiv.org/pdf/2205.05198|Korthikanti et al 2022 - Reducing Activation Recomputation in Large Transformer Models]]   * [[https://arxiv.org/pdf/2205.05198|Korthikanti et al 2022 - Reducing Activation Recomputation in Large Transformer Models]]
 +  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]
  
 ===== Customized Implementations on GPUs ===== ===== Customized Implementations on GPUs =====
Line 111: Line 115:
   * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]   * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]
   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]
 +  * [[https://arxiv.org/pdf/2407.08608|Shah et al 2024 - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision]]
 +  * [[https://arxiv.org/pdf/2505.22758|Nrusimha et al 2025 - FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference]] Fuses everything into one big kernel
  
 ===== Resources ===== ===== Resources =====
Line 131: Line 137:
 ===== Related Pages ===== ===== Related Pages =====
   * [[Distributed Training]]   * [[Distributed Training]]
 +  * [[Efficient NNs]]
   * [[Systems & ML]]   * [[Systems & ML]]
  
  
ml/gpu_deep_learning.1742888924.txt.gz · Last modified: 2025/03/25 07:48 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki