Differences

This shows you the differences between two versions of the page.

--- ml:gpu_deep_learning [2025/05/13 18:10] – [Details of Deep Learning on GPUs] jmflanig
+++ ml:gpu_deep_learning [2025/07/17 03:25] (current) – [Miscellaneous Transformer & GPU Papers] jmflanig
@@ Line 102: / Line 102: @@
   * [[https://arxiv.org/pdf/2309.06180|Kwon et al 2023 - Efficient Memory Management for Large Language Model Serving with PagedAttention]]
   * [[https://arxiv.org/pdf/2205.05198|Korthikanti et al 2022 - Reducing Activation Recomputation in Large Transformer Models]]
+  * [[https://arxiv.org/pdf/2503.15798|Jie et al 2025 - Mixture of Lookup Experts]]
 ===== Customized Implementations on GPUs =====
@@ Line 114: / Line 115: @@
   * [[https://arxiv.org/pdf/2205.14135.pdf|Dao et al 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness]]
   * [[https://arxiv.org/pdf/2307.08691.pdf|Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning]]
+  * [[https://arxiv.org/pdf/2407.08608|Shah et al 2024 - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision]]
+  * [[https://arxiv.org/pdf/2505.22758|Nrusimha et al 2025 - FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference]] Fuses everything into one big kernel
 ===== Resources =====