Differences

This shows you the differences between two versions of the page.

--- ml:gpu_deep_learning [2025/04/12 01:03] – [Details of Deep Learning on GPUs] jmflanig
+++ ml:gpu_deep_learning [2025/07/17 03:25] (current) – [Miscellaneous Transformer & GPU Papers] jmflanig
@@ Line 19: / Line 19: @@
      * [[https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf|CUDA C++ Programming Guide]] [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/|html version]] (Has the official list of what each compute capability version means in Appendix K)  If you want to know how many CUDA cores, warp schedulers, etc each SM in a GPU has, look up the GPU's compute capability version (Chapter 5 [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|here]]) and then check Appendix F in this doc to find out.
      * [[https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]] To find the compute cabability of a GPU, find out it's generation (Kepler, Turing, etc) and look up the compute cabability in Chapter 5 of this doc. (old version: [[https://web.archive.org/web/20220808074234/https://docs.nvidia.com/pdf/CUDA_Compatibility.pdf|NVidia - CUDA Compatibility]], see Ch 3)
-      * Ampere Architecture Whitepaper: [[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf|NVIDIA A100 Tensor Core GPU Architecture]]
+    * Ampere Architecture Whitepaper: [[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf|NVIDIA A100 Tensor Core GPU Architecture]]
+    * Hopper Architecture Whitepaper: [[https://www.advancedclustering.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf|NVIDIA H100 Tensor Core GPU Architecture]]
+    * [[https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf|PTX Instruction Set]] See also [[https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#|here]]
    * Examples of GPU performance analysis
      * [[https://arxiv.org/pdf/1911.02150.pdf|Shazeer 2019]] (By one of the inventors of the Transformer)
@@ Line 101: / Line 102: @@
   * [[https://arxiv.org/pdf/2309.06180|Kwon et al 2023 - Efficient Memory Management for Large Language Model Serving with PagedAttention]]
   * [[https://arxiv.org/pdf/2205.05198|Korthikanti et al 2022 - Reducing Activation Recomputation in Large Transformer Models]]
+  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]
 ===== Customized Implementations on GPUs =====
@@ Line 113: / Line 115: @@
   * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]
   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]
+  * [[https://arxiv.org/pdf/2407.08608|Shah et al 2024 - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision]]
+  * [[https://arxiv.org/pdf/2505.22758|Nrusimha et al 2025 - FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference]] Fuses everything into one big kernel
 ===== Resources =====