User Tools

Site Tools


ml:gpu_deep_learning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:gpu_deep_learning [2025/04/12 01:03] – [Details of Deep Learning on GPUs] jmflanigml:gpu_deep_learning [2025/07/17 03:25] (current) – [Miscellaneous Transformer & GPU Papers] jmflanig
Line 19: Line 19:
      * [[https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf|CUDA C++ Programming Guide]] [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/|html version]] (Has the official list of what each compute capability version means in Appendix K)  If you want to know how many CUDA cores, warp schedulers, etc each SM in a GPU has, look up the GPU's compute capability version (Chapter 5 [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|here]]) and then check Appendix F in this doc to find out.      * [[https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf|CUDA C++ Programming Guide]] [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/|html version]] (Has the official list of what each compute capability version means in Appendix K)  If you want to know how many CUDA cores, warp schedulers, etc each SM in a GPU has, look up the GPU's compute capability version (Chapter 5 [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|here]]) and then check Appendix F in this doc to find out.
      * [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]] To find the compute cabability of a GPU, find out it's generation (Kepler, Turing, etc) and look up the compute cabability in Chapter 5 of this doc. (old version: [[https://web.archive.org/web/20220808074234/https://docs.nvidia.com/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]], see Ch 3)      * [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]] To find the compute cabability of a GPU, find out it's generation (Kepler, Turing, etc) and look up the compute cabability in Chapter 5 of this doc. (old version: [[https://web.archive.org/web/20220808074234/https://docs.nvidia.com/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]], see Ch 3)
-      * Ampere Architecture Whitepaper: [[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf|NVIDIA A100 Tensor Core GPU Architecture]] +    * Ampere Architecture Whitepaper: [[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf|NVIDIA A100 Tensor Core GPU Architecture]] 
 +    * Hopper Architecture Whitepaper: [[https://www.advancedclustering.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf|NVIDIA H100 Tensor Core GPU Architecture]] 
 +    * [[https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf|PTX Instruction Set]] See also [[https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#|here]]
    * Examples of GPU performance analysis    * Examples of GPU performance analysis
      * [[https://arxiv.org/pdf/1911.02150.pdf|Shazeer 2019]] (By one of the inventors of the Transformer)      * [[https://arxiv.org/pdf/1911.02150.pdf|Shazeer 2019]] (By one of the inventors of the Transformer)
Line 101: Line 102:
   * [[https://arxiv.org/pdf/2309.06180|Kwon et al 2023 - Efficient Memory Management for Large Language Model Serving with PagedAttention]]   * [[https://arxiv.org/pdf/2309.06180|Kwon et al 2023 - Efficient Memory Management for Large Language Model Serving with PagedAttention]]
   * [[https://arxiv.org/pdf/2205.05198|Korthikanti et al 2022 - Reducing Activation Recomputation in Large Transformer Models]]   * [[https://arxiv.org/pdf/2205.05198|Korthikanti et al 2022 - Reducing Activation Recomputation in Large Transformer Models]]
 +  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]
  
 ===== Customized Implementations on GPUs ===== ===== Customized Implementations on GPUs =====
Line 113: Line 115:
   * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]   * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]
   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]
 +  * [[https://arxiv.org/pdf/2407.08608|Shah et al 2024 - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision]]
 +  * [[https://arxiv.org/pdf/2505.22758|Nrusimha et al 2025 - FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference]] Fuses everything into one big kernel
  
 ===== Resources ===== ===== Resources =====
ml/gpu_deep_learning.1744419837.txt.gz · Last modified: 2025/04/12 01:03 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki