Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fused L2 1-NN based on cutlass 3xTF32 / DMMA (#1118)
-- 3xTF32 & DMMA cutlass based persistent FusedL2NN kernel version loosely based on grouped gemm but customized for single problem size. -- as the value of `k` increases the performance benefit of this implementation gets better. for k==64 upto 1.3x, for k ==128 upto 1.53x, k == 256, up to 1.67x. -- for all the sizes of `k` this kernel out performs previous implementation. -- attaching the results of FusedL2NN Benchmark of previous implementation with this cutlass version. Authors: - Mahesh Doijade (https://github.com/mdoijade) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Tamas Bela Feher (https://github.com/tfeher) URL: #1118
- Loading branch information