[Performance] Improve segment_matmul by reducing launching overheads #213
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CUTLASS grouped gemm requires copying matrix pointers and layouts to the device memory, which brings significant "launch" overheads, more concretely, 7 pageable H2D copies. This PR sets up the arguments for grouped gemm in a CPU pinned buffer manually and copy it to the device memory at once to reduce such overheads.
Other changes include setting CUDA stream for the grouped gemm, adding proper
C10_CUDA_CHECK
andC10_CUDA_KERNEL_LAUNCH_CHECK
.Performance
Benchmarking with the following script, this PR reduces the op time from 0.29 ms to 0.05 ms on my desktop (RTX 3090).
cc @rusty1s @puririshi98