Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-merge branch-24.02 to branch-24.04 #2114

Merged
merged 1 commit into from
Jan 23, 2024
Merged

Conversation

GPUtester
Copy link
Contributor

Forward-merge triggered by push to branch-24.02 that creates a PR to keep branch-24.04 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.

This PR replaces the current cublas `gemm` backend of `raft::linalg::gemm`  with cublasLt `matmul`. The latter is more flexible and allows to decouple selection of the algorithm heuristics from its execution.
Thanks to this change, this PR adds memoization of the matmul heuristics and the other arguments (matrix layouts and the matmul descriptor).

#### Performance on specific workloads
IVF-PQ performs two gemm operations during pre-processing on small work sizes. The preprocessing consists of a few kernel launches and a rather heavy logic on CPU side (which results in gaps between the kernel launches).
This PR **roughly halves the gemm kernel launch latency** (approx 10us -> 5us, as measured by NVTX from entering `matmul` wrapper on the host to the launch of the kernel).
As a motivation example: this PR improves QPS of IVF-PQ by ~5-15% on small batches (tested on SIFT-128, n_queries = 1, n_probes = 20 and 200) .

#### Synthetic benchmarks: no significant difference
Running all 4K+ benchmarks across RAFT does not bring significant difference in CPU/GPU exec time.
  - Overall, the average exec time reduction of ~0.5%
  - 100+ benchmarks show 5-10% time reduction
  - 9 benchmarks show 5-10% time increase (none of them use GEMM)

Only a small fraction of RAFT benchmarks actually use GEMM, so most of the stronger deviations are likely due to pure chance.  Having no gain across all benchmarks is not surprising, because we've designed most of them for somewhat larger work sizes, which hides the gemm latency.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1736
@GPUtester GPUtester requested a review from a team as a code owner January 23, 2024 19:00
@github-actions github-actions bot added the cpp label Jan 23, 2024
@GPUtester GPUtester merged commit e9ba740 into branch-24.04 Jan 23, 2024
18 checks passed
@GPUtester
Copy link
Contributor Author

SUCCESS - forward-merge complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants