📚 Modern CUDA Learn Notes with PyTorch for Beginners: It includes Tensor/CUDA Cores, TF32/F16/BF16/F8, 📖150+ CUDA Kernels🔥🔥 with PyTorch bindings, 📖30+ LLM/VLM🔥, 📖40+ CV/C++...🔥, 📖50+ CUDA/CuTe...🔥 Blogs and 📖toy-hgemm library⚡️⚡️ which can achieve 98%~100%
performance of cuBLAS, check 📖HGEMM Supported Matrix👇 for techs details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores math algorithm CUBLAS_GEMM_DEFAULT_TENSOR_OP
, the HGEMM (WMMA/MMA/CuTe)
implemented in this repo (blue
🔵) can achieve 98%~100%
of its (orange
🟠) performance. Please check toy-hgemm library⚡️⚡️ for more details.
CUDA Cores | Sliced K (Loop over K) | Tile Block (BMxBK) | Tile Thread (t 8x8) |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
WMMA (m16n16k16) | MMA (m16n8k16) | Pack LDST (128 bits) | SMEM Padding |
✔️ | ✔️ | ✔️ | ✔️ |
Copy Async | Tile MMA (More Threads) | Tile Warp (More Values) | Multi Stages (2/3/4) |
✔️ | ✔️ | ✔️ | ✔️ |
Reg Double Buffers | Block Swizzle | Warp Swizzle | SMEM Swizzle (CuTe) |
✔️ | ✔️ | ✔️ | ✔️ |
Collective Store (Warp Shfl) | Row Major (NN) | Col Major (TN) | SGEMM FP32/TF32 |
✔️ | ✔️ | ✔️ | ✔️ |
@misc{CUDA-Learn-Notes@2024,
title={CUDA-Learn-Notes: A Modern CUDA Learn Notes with PyTorch for Beginners},
url={https://github.com/DefTruth/CUDA-Learn-Notes},
note={Open-source software available at https://github.com/DefTruth/CUDA-Learn-Notes},
author={DefTruth etc},
year={2024}
}
📖 150+ CUDA Kernels 🔥🔥 (面试常考题目) (©️back👆🏻)
Workflow: custom CUDA kernel impl -> PyTorch Python bindings -> Run tests. 👉TIPS: *
= Tensor Cores(WMMA/MMA), otherwise, CUDA Cores; /
= not supported; ✔️
= supported; ❔
= in my plan.
📖 大模型|多模态|Diffusion|推理优化 (本人作者) (©️back👆🏻)
📖 CV推理部署|C++|算法|技术随笔 (本人作者) (©️back👆🏻)
📖 CUTLASS|CuTe|NCCL|CUDA|文章推荐 (其他作者) (©️back👆🏻)
💡说明: 本小节整理一些自己比较喜欢的文章。欢迎大家提PR推荐更多优秀的文章!
©️License (©️back👆🏻)
GNU General Public License v3.0
🎉Contribute (©️back👆🏻)
How to contribute? please check 🌤🌤CONTRIBUTE🎉🎉.