-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Opt] Enforce the UT Coverity and add benchmark for transpose
#2421
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand it right, that normally we use cublasgeam() for matrix transpose, and the only reason for the kernel in this file is that cublas<t>geam()
does not support the half-precision float?
Maybe this is a little beyond the scope of this PR, but I think we should switch to cublasLt here. The corresponding function is cublasLtMatrixTransform(), which seems to support half
data type. We're going to slowly deprecate the usage of old cuBLAS handle, because it's not thread-safe (the handle state includes the CUDA stream, which must not be set concurrently in multiple streams).
@@ -49,7 +51,7 @@ RAFT_KERNEL transpose_half_kernel(IndexType n_rows, | |||
|
|||
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) { | |||
if (x < n_cols && (y + j) < n_rows) { | |||
tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * n_cols + x]); | |||
tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * stride_in + x]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not the part of the change, but it's advisable to use raft's helpers instread of __xxx
functions unless you need a specific cache behavior (for which, maybe, we should add more helpers?..)
tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * stride_in + x]); | |
tile[threadIdx.y + j][threadIdx.x] = raft::ldg(&in[(y + j) * stride_in + x]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @achirkin , thank you for your suggestion! Yeah, you're right! Because cublas<t>geam()
does not support the half. About the cublasLtMatrixTransform
. Let me see how many I can change. (would you like to suggest changing this PR or the next separate one to change all of transpose
? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @achirkin, I just tested the cublasLtMatrixTransform
benchmark and found that performance would be lower than the current implementation by around 10~20%. So, could I keep the current implementation temporarily:
transpose_half_kernel cublasLtMatrixTransform rows* cols
TransposeBench<half, int, raft::row_major>/0/manual_time 0.006 ms 0.009 ms 10#10000
TransposeBench<half, int, raft::row_major>/1/manual_time 0.016 ms 0.017 ms 10#100000
TransposeBench<half, int, raft::row_major>/2/manual_time 0.122 ms 0.108 ms 10#1000000
TransposeBench<half, int, raft::row_major>/3/manual_time 0.011 ms 0.014 ms 128#10000
TransposeBench<half, int, raft::row_major>/4/manual_time 0.084 ms 0.091 ms 128#100000
TransposeBench<half, int, raft::row_major>/5/manual_time 0.762 ms 0.845 ms 128#1000000
TransposeBench<half, int, raft::row_major>/6/manual_time 0.022 ms 0.023 ms 256#10000
TransposeBench<half, int, raft::row_major>/7/manual_time 0.156 ms 0.186 ms 256#100000
TransposeBench<half, int, raft::row_major>/8/manual_time 1.53 ms 1.80 ms 256#1000000
TransposeBench<half, int, raft::row_major>/9/manual_time 0.035 ms 0.041 ms 512#10000
TransposeBench<half, int, raft::row_major>/10/manual_time 0.310 ms 0.395 ms 512#100000
TransposeBench<half, int, raft::row_major>/11/manual_time 3.09 ms 3.91 ms 512#1000000
TransposeBench<half, int, raft::row_major>/12/manual_time 0.073 ms 0.076 ms 1024#10000
TransposeBench<half, int, raft::row_major>/13/manual_time 0.642 ms 0.796 ms 1024#100000
TransposeBench<half, int, raft::row_major>/14/manual_time 6.29 ms 7.94 ms 1024#1000000
TransposeBench<half, int, raft::col_major>/0/manual_time 0.006 ms 0.009 ms 10#10000
TransposeBench<half, int, raft::col_major>/1/manual_time 0.017 ms 0.017 ms 10#100000
TransposeBench<half, int, raft::col_major>/2/manual_time 0.125 ms 0.109 ms 10#1000000
TransposeBench<half, int, raft::col_major>/3/manual_time 0.011 ms 0.014 ms 128#10000
TransposeBench<half, int, raft::col_major>/4/manual_time 0.084 ms 0.091 ms 128#100000
TransposeBench<half, int, raft::col_major>/5/manual_time 0.762 ms 0.847 ms 128#1000000
TransposeBench<half, int, raft::col_major>/6/manual_time 0.022 ms 0.023 ms 256#10000
TransposeBench<half, int, raft::col_major>/7/manual_time 0.156 ms 0.186 ms 256#100000
TransposeBench<half, int, raft::col_major>/8/manual_time 1.53 ms 1.80 ms 256#1000000
TransposeBench<half, int, raft::col_major>/9/manual_time 0.035 ms 0.041 ms 512#10000
TransposeBench<half, int, raft::col_major>/10/manual_time 0.310 ms 0.396 ms 512#100000
TransposeBench<half, int, raft::col_major>/11/manual_time 3.09 ms 3.91 ms 512#1000000
TransposeBench<half, int, raft::col_major>/12/manual_time 0.073 ms 0.076 ms 1024#10000
TransposeBench<half, int, raft::col_major>/13/manual_time 0.643 ms 0.796 ms 1024#100000
TransposeBench<half, int, raft::col_major>/14/manual_time 6.29 ms 7.95 ms 1024#1000000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhdong lets create an issue for Artem’s suggested change and reference it in a todo comment in the corresponding kernel in the code. I think we should investigate this for sure so that we are utilizing math libs where at all possible (and not having to maintain both math libs and our own custom impls) but I do not think the further investigation should hold up this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhdong lets create an issue for Artem’s suggested change and reference it in a todo comment in the corresponding kernel in the code. I think we should investigate this for sure so that we are utilizing math libs where at all possible (and not having to maintain both math libs and our own custom impls) but I do not think the further investigation should hold up this PR.
Yeah, here it is: #2436
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let's move the cublasLtMatrixTransform
discussion to a follow up issue/pr; this one LGTM!
- Fix the `transpose_half` is not compatible with the sub-matrix cases.
1c07d60
to
8501d62
Compare
transpose_half
is not compatible with the sub-matrix cases.