Differences

This shows you the differences between two versions of the page.

--- ml:gpu_deep_learning [2025/03/25 07:44] – [Memory Reduction Techniques] jmflanig
+++ ml:gpu_deep_learning [2025/07/17 03:25] (current) – [Miscellaneous Transformer & GPU Papers] jmflanig
@@ Line 19: / Line 19: @@
      * [[https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf|CUDA C++ Programming Guide]] [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/|html version]] (Has the official list of what each compute capability version means in Appendix K)  If you want to know how many CUDA cores, warp schedulers, etc each SM in a GPU has, look up the GPU's compute capability version (Chapter 5 [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|here]]) and then check Appendix F in this doc to find out.
      * [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]] To find the compute cabability of a GPU, find out it's generation (Kepler, Turing, etc) and look up the compute cabability in Chapter 5 of this doc. (old version: [[https://web.archive.org/web/20220808074234/https://docs.nvidia.com/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]], see Ch 3)
+    * Ampere Architecture Whitepaper: [[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf|NVIDIA A100 Tensor Core GPU Architecture]]
+    * Hopper Architecture Whitepaper: [[https://www.advancedclustering.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf|NVIDIA H100 Tensor Core GPU Architecture]]
+    * [[https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf|PTX Instruction Set]] See also [[https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#|here]]
    * Examples of GPU performance analysis
      * [[https://arxiv.org/pdf/1911.02150.pdf|Shazeer 2019]] (By one of the inventors of the Transformer)
@@ Line 80: / Line 83: @@
     * [[https://arxiv.org/pdf/1904.10631.pdf|Sohoni et al 2019 - Low-Memory Neural Network Training: A Technical Report]]
   * Papers
-    * **Gradient Checkpointing**: [[https://arxiv.org/pdf/1604.06174.pdf|Chen et al 2016 - Training Deep Nets with Sublinear Memory Cost]]
+    * **Gradient Checkpointing aka Activation Checkpointing**: [[https://arxiv.org/pdf/1604.06174.pdf|Chen et al 2016 - Training Deep Nets with Sublinear Memory Cost]]
       * Implemented in pytorch in torch.utils.checkpoint: [[https://pytorch.org/docs/stable/checkpoint.html|Checkpointing]]
-      * A paper the seems to have re-invented this as "activation checkpointing" (and often cited): [[https://arxiv.org/pdf/1911.13214|Beaumont et al 2019 - Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory]]
+      * [[https://arxiv.org/pdf/1911.13214|Beaumont et al 2019 - Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory]] This paper has an optimal checkpointing algorithm. Calls it "activation checkpointing".
       * Computing the forward gradient instead of using backprop would allow you to reduce the memory cost of the computation graph (don't need to keep nodes that won't be used later).  See this paper on forward gradient: [[https://arxiv.org/pdf/2202.08587.pdf|Baydin et al 2022 - Gradients without Backpropagation]] [[https://github.com/orobix/fwdgrad|github]]
     * [[https://arxiv.org/pdf/1910.02653.pdf|Jain et al 2019 - Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization]]
@@ Line 99: / Line 102: @@
   * [[https://arxiv.org/pdf/2309.06180|Kwon et al 2023 - Efficient Memory Management for Large Language Model Serving with PagedAttention]]
   * [[https://arxiv.org/pdf/2205.05198|Korthikanti et al 2022 - Reducing Activation Recomputation in Large Transformer Models]]
+  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]
 ===== Customized Implementations on GPUs =====
@@ Line 111: / Line 115: @@
   * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]
   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]
+  * [[https://arxiv.org/pdf/2407.08608|Shah et al 2024 - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision]]
+  * [[https://arxiv.org/pdf/2505.22758|Nrusimha et al 2025 - FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference]] Fuses everything into one big kernel
 ===== Resources =====
@@ Line 131: / Line 137: @@
 ===== Related Pages =====
   * [[Distributed Training]]
+  * [[Efficient NNs]]
   * [[Systems & ML]]