Forward-merge branch-24.02 to branch-24.04 #2114

GPUtester · 2024-01-23T19:00:17Z

Forward-merge triggered by push to branch-24.02 that creates a PR to keep branch-24.04 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.

This PR replaces the current cublas `gemm` backend of `raft::linalg::gemm` with cublasLt `matmul`. The latter is more flexible and allows to decouple selection of the algorithm heuristics from its execution. Thanks to this change, this PR adds memoization of the matmul heuristics and the other arguments (matrix layouts and the matmul descriptor). #### Performance on specific workloads IVF-PQ performs two gemm operations during pre-processing on small work sizes. The preprocessing consists of a few kernel launches and a rather heavy logic on CPU side (which results in gaps between the kernel launches). This PR **roughly halves the gemm kernel launch latency** (approx 10us -> 5us, as measured by NVTX from entering `matmul` wrapper on the host to the launch of the kernel). As a motivation example: this PR improves QPS of IVF-PQ by ~5-15% on small batches (tested on SIFT-128, n_queries = 1, n_probes = 20 and 200) . #### Synthetic benchmarks: no significant difference Running all 4K+ benchmarks across RAFT does not bring significant difference in CPU/GPU exec time. - Overall, the average exec time reduction of ~0.5% - 100+ benchmarks show 5-10% time reduction - 9 benchmarks show 5-10% time increase (none of them use GEMM) Only a small fraction of RAFT benchmarks actually use GEMM, so most of the stronger deviations are likely due to pure chance. Having no gain across all benchmarks is not surprising, because we've designed most of them for somewhat larger work sizes, which hides the gemm latency. Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #1736

GPUtester · 2024-01-23T19:00:30Z

SUCCESS - forward-merge complete.

GPUtester requested a review from a team as a code owner January 23, 2024 19:00

github-actions bot added the cpp label Jan 23, 2024

GPUtester merged commit e9ba740 into branch-24.04 Jan 23, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward-merge branch-24.02 to branch-24.04 #2114

Forward-merge branch-24.02 to branch-24.04 #2114

GPUtester commented Jan 23, 2024

GPUtester commented Jan 23, 2024

Forward-merge branch-24.02 to branch-24.04 #2114

Forward-merge branch-24.02 to branch-24.04 #2114

Conversation

GPUtester commented Jan 23, 2024

GPUtester commented Jan 23, 2024