Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Opt] Enforce the UT Coverity and add benchmark for transpose #2421

Closed
wants to merge 1 commit into from

Conversation

rhdong
Copy link
Member

@rhdong rhdong commented Aug 27, 2024

  • Fix the transpose_half is not compatible with the sub-matrix cases.
  • Benchmark on A100 with 80GB PCIE:
Running ./cpp/build/bench/prims/LINALG_BENCH
Run on (16 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x4)
Load Average: 4.09, 8.38, 6.61
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------
Benchmark                                                           Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------
TransposeBench<float, int, raft::row_major>/0/manual_time       0.009 ms        0.096 ms        78786 10#10000
TransposeBench<float, int, raft::row_major>/1/manual_time       0.021 ms        0.107 ms        32933 10#100000
TransposeBench<float, int, raft::row_major>/2/manual_time       0.169 ms        0.252 ms         4094 10#1000000
TransposeBench<float, int, raft::row_major>/3/manual_time       0.013 ms        0.101 ms        52695 128#10000
TransposeBench<float, int, raft::row_major>/4/manual_time       0.085 ms        0.173 ms         8084 128#100000
TransposeBench<float, int, raft::row_major>/5/manual_time       0.810 ms        0.898 ms          840 128#1000000
TransposeBench<float, int, raft::row_major>/6/manual_time       0.020 ms        0.108 ms        35381 256#10000
TransposeBench<float, int, raft::row_major>/7/manual_time       0.151 ms        0.239 ms         4559 256#100000
TransposeBench<float, int, raft::row_major>/8/manual_time        1.50 ms         1.59 ms          456 256#1000000
TransposeBench<float, int, raft::row_major>/9/manual_time       0.035 ms        0.124 ms        19852 512#10000
TransposeBench<float, int, raft::row_major>/10/manual_time      0.287 ms        0.375 ms         2430 512#100000
TransposeBench<float, int, raft::row_major>/11/manual_time       2.92 ms         3.02 ms          235 512#1000000
TransposeBench<float, int, raft::row_major>/12/manual_time      0.061 ms        0.150 ms        10898 1024#10000
TransposeBench<float, int, raft::row_major>/13/manual_time      0.556 ms        0.644 ms         1208 1024#100000
TransposeBench<float, int, raft::row_major>/14/manual_time       5.81 ms         5.92 ms          115 1024#1000000
TransposeBench<half, int, raft::row_major>/0/manual_time        0.009 ms        0.096 ms        78054 10#10000
TransposeBench<half, int, raft::row_major>/1/manual_time        0.018 ms        0.104 ms        37908 10#100000
TransposeBench<half, int, raft::row_major>/2/manual_time        0.093 ms        0.176 ms         7538 10#1000000
TransposeBench<half, int, raft::row_major>/3/manual_time        0.011 ms        0.098 ms        59993 128#10000
TransposeBench<half, int, raft::row_major>/4/manual_time        0.052 ms        0.138 ms        13555 128#100000
TransposeBench<half, int, raft::row_major>/5/manual_time        0.364 ms        0.451 ms         1931 128#1000000
TransposeBench<half, int, raft::row_major>/6/manual_time        0.015 ms        0.102 ms        47278 256#10000
TransposeBench<half, int, raft::row_major>/7/manual_time        0.089 ms        0.175 ms         7926 256#100000
TransposeBench<half, int, raft::row_major>/8/manual_time        0.755 ms        0.840 ms          918 256#1000000
TransposeBench<half, int, raft::row_major>/9/manual_time        0.026 ms        0.113 ms        26919 512#10000
TransposeBench<half, int, raft::row_major>/10/manual_time       0.178 ms        0.263 ms         3959 512#100000
TransposeBench<half, int, raft::row_major>/11/manual_time        1.51 ms         1.59 ms          457 512#1000000
TransposeBench<half, int, raft::row_major>/12/manual_time       0.039 ms        0.126 ms        18103 1024#10000
TransposeBench<half, int, raft::row_major>/13/manual_time       0.324 ms        0.410 ms         2178 1024#100000
TransposeBench<half, int, raft::row_major>/14/manual_time        3.49 ms         3.59 ms          196 1024#1000000
TransposeBench<float, int, raft::col_major>/0/manual_time       0.009 ms        0.096 ms        80619 10#10000
TransposeBench<float, int, raft::col_major>/1/manual_time       0.021 ms        0.107 ms        33000 10#100000
TransposeBench<float, int, raft::col_major>/2/manual_time       0.169 ms        0.252 ms         4101 10#1000000
TransposeBench<float, int, raft::col_major>/3/manual_time       0.013 ms        0.101 ms        53351 128#10000
TransposeBench<float, int, raft::col_major>/4/manual_time       0.085 ms        0.173 ms         8087 128#100000
TransposeBench<float, int, raft::col_major>/5/manual_time       0.810 ms        0.898 ms          839 128#1000000
TransposeBench<float, int, raft::col_major>/6/manual_time       0.020 ms        0.108 ms        35044 256#10000
TransposeBench<float, int, raft::col_major>/7/manual_time       0.151 ms        0.239 ms         4560 256#100000
TransposeBench<float, int, raft::col_major>/8/manual_time        1.51 ms         1.60 ms          455 256#1000000
TransposeBench<float, int, raft::col_major>/9/manual_time       0.035 ms        0.124 ms        19759 512#10000
TransposeBench<float, int, raft::col_major>/10/manual_time      0.287 ms        0.376 ms         2429 512#100000
TransposeBench<float, int, raft::col_major>/11/manual_time       2.92 ms         3.02 ms          234 512#1000000
TransposeBench<float, int, raft::col_major>/12/manual_time      0.061 ms        0.150 ms        10891 1024#10000
TransposeBench<float, int, raft::col_major>/13/manual_time      0.556 ms        0.644 ms         1207 1024#100000
TransposeBench<float, int, raft::col_major>/14/manual_time       5.79 ms         5.91 ms          115 1024#1000000
TransposeBench<half, int, raft::col_major>/0/manual_time        0.009 ms        0.096 ms        79522 10#10000
TransposeBench<half, int, raft::col_major>/1/manual_time        0.018 ms        0.104 ms        37535 10#100000
TransposeBench<half, int, raft::col_major>/2/manual_time        0.093 ms        0.176 ms         7536 10#1000000
TransposeBench<half, int, raft::col_major>/3/manual_time        0.011 ms        0.098 ms        61023 128#10000
TransposeBench<half, int, raft::col_major>/4/manual_time        0.052 ms        0.138 ms        13527 128#100000
TransposeBench<half, int, raft::col_major>/5/manual_time        0.364 ms        0.451 ms         1929 128#1000000
TransposeBench<half, int, raft::col_major>/6/manual_time        0.015 ms        0.102 ms        47299 256#10000
TransposeBench<half, int, raft::col_major>/7/manual_time        0.089 ms        0.175 ms         7927 256#100000
TransposeBench<half, int, raft::col_major>/8/manual_time        0.755 ms        0.841 ms          919 256#1000000
TransposeBench<half, int, raft::col_major>/9/manual_time        0.026 ms        0.113 ms        26910 512#10000
TransposeBench<half, int, raft::col_major>/10/manual_time       0.178 ms        0.263 ms         3950 512#100000
TransposeBench<half, int, raft::col_major>/11/manual_time        1.51 ms         1.59 ms          458 512#1000000
TransposeBench<half, int, raft::col_major>/12/manual_time       0.039 ms        0.126 ms        18098 1024#10000
TransposeBench<half, int, raft::col_major>/13/manual_time       0.323 ms        0.409 ms         2166 1024#100000
TransposeBench<half, int, raft::col_major>/14/manual_time        3.48 ms         3.58 ms          197 1024#1000000

@rhdong rhdong requested review from a team as code owners August 27, 2024 23:08
@rhdong rhdong requested a review from cjnolet August 27, 2024 23:08
@rhdong rhdong added 3 - Ready for Review non-breaking Non-breaking change enhancement New feature or request and removed cpp CMake labels Aug 27, 2024
@rhdong rhdong added Benchmarks feature request New feature or request labels Aug 27, 2024
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand it right, that normally we use cublasgeam() for matrix transpose, and the only reason for the kernel in this file is that cublas<t>geam() does not support the half-precision float?
Maybe this is a little beyond the scope of this PR, but I think we should switch to cublasLt here. The corresponding function is cublasLtMatrixTransform(), which seems to support half data type. We're going to slowly deprecate the usage of old cuBLAS handle, because it's not thread-safe (the handle state includes the CUDA stream, which must not be set concurrently in multiple streams).

@@ -49,7 +51,7 @@ RAFT_KERNEL transpose_half_kernel(IndexType n_rows,

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
if (x < n_cols && (y + j) < n_rows) {
tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * n_cols + x]);
tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * stride_in + x]);
Copy link
Contributor

@achirkin achirkin Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not the part of the change, but it's advisable to use raft's helpers instread of __xxx functions unless you need a specific cache behavior (for which, maybe, we should add more helpers?..)

Suggested change
tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * stride_in + x]);
tile[threadIdx.y + j][threadIdx.x] = raft::ldg(&in[(y + j) * stride_in + x]);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @achirkin , thank you for your suggestion! Yeah, you're right! Because cublas<t>geam()does not support the half. About the cublasLtMatrixTransform. Let me see how many I can change. (would you like to suggest changing this PR or the next separate one to change all of transpose ? )

Copy link
Member Author

@rhdong rhdong Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @achirkin, I just tested the cublasLtMatrixTransform benchmark and found that performance would be lower than the current implementation by around 10~20%. So, could I keep the current implementation temporarily:

                                                            transpose_half_kernel   cublasLtMatrixTransform    rows* cols
TransposeBench<half, int, raft::row_major>/0/manual_time    0.006 ms                0.009 ms                   10#10000
TransposeBench<half, int, raft::row_major>/1/manual_time    0.016 ms                0.017 ms                   10#100000
TransposeBench<half, int, raft::row_major>/2/manual_time    0.122 ms                0.108 ms                   10#1000000
TransposeBench<half, int, raft::row_major>/3/manual_time    0.011 ms                0.014 ms                   128#10000
TransposeBench<half, int, raft::row_major>/4/manual_time    0.084 ms                0.091 ms                   128#100000
TransposeBench<half, int, raft::row_major>/5/manual_time    0.762 ms                0.845 ms                   128#1000000
TransposeBench<half, int, raft::row_major>/6/manual_time    0.022 ms                0.023 ms                   256#10000
TransposeBench<half, int, raft::row_major>/7/manual_time    0.156 ms                0.186 ms                   256#100000
TransposeBench<half, int, raft::row_major>/8/manual_time     1.53 ms                 1.80 ms                   256#1000000
TransposeBench<half, int, raft::row_major>/9/manual_time    0.035 ms                0.041 ms                   512#10000
TransposeBench<half, int, raft::row_major>/10/manual_time   0.310 ms                0.395 ms                   512#100000
TransposeBench<half, int, raft::row_major>/11/manual_time    3.09 ms                 3.91 ms                   512#1000000
TransposeBench<half, int, raft::row_major>/12/manual_time   0.073 ms                0.076 ms                   1024#10000
TransposeBench<half, int, raft::row_major>/13/manual_time   0.642 ms                0.796 ms                   1024#100000
TransposeBench<half, int, raft::row_major>/14/manual_time    6.29 ms                 7.94 ms                   1024#1000000

TransposeBench<half, int, raft::col_major>/0/manual_time    0.006 ms                0.009 ms                   10#10000
TransposeBench<half, int, raft::col_major>/1/manual_time    0.017 ms                0.017 ms                   10#100000
TransposeBench<half, int, raft::col_major>/2/manual_time    0.125 ms                0.109 ms                   10#1000000
TransposeBench<half, int, raft::col_major>/3/manual_time    0.011 ms                0.014 ms                   128#10000
TransposeBench<half, int, raft::col_major>/4/manual_time    0.084 ms                0.091 ms                   128#100000
TransposeBench<half, int, raft::col_major>/5/manual_time    0.762 ms                0.847 ms                   128#1000000
TransposeBench<half, int, raft::col_major>/6/manual_time    0.022 ms                0.023 ms                   256#10000
TransposeBench<half, int, raft::col_major>/7/manual_time    0.156 ms                0.186 ms                   256#100000
TransposeBench<half, int, raft::col_major>/8/manual_time     1.53 ms                 1.80 ms                   256#1000000
TransposeBench<half, int, raft::col_major>/9/manual_time    0.035 ms                0.041 ms                   512#10000
TransposeBench<half, int, raft::col_major>/10/manual_time   0.310 ms                0.396 ms                   512#100000
TransposeBench<half, int, raft::col_major>/11/manual_time    3.09 ms                 3.91 ms                   512#1000000
TransposeBench<half, int, raft::col_major>/12/manual_time   0.073 ms                0.076 ms                   1024#10000
TransposeBench<half, int, raft::col_major>/13/manual_time   0.643 ms                0.796 ms                   1024#100000
TransposeBench<half, int, raft::col_major>/14/manual_time    6.29 ms                 7.95 ms                   1024#1000000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdong lets create an issue for Artem’s suggested change and reference it in a todo comment in the corresponding kernel in the code. I think we should investigate this for sure so that we are utilizing math libs where at all possible (and not having to maintain both math libs and our own custom impls) but I do not think the further investigation should hold up this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdong lets create an issue for Artem’s suggested change and reference it in a todo comment in the corresponding kernel in the code. I think we should investigate this for sure so that we are utilizing math libs where at all possible (and not having to maintain both math libs and our own custom impls) but I do not think the further investigation should hold up this PR.

Yeah, here it is: #2436

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's move the cublasLtMatrixTransform discussion to a follow up issue/pr; this one LGTM!

- Fix the `transpose_half` is not compatible with the sub-matrix cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Benchmarks CMake cpp enhancement New feature or request feature request New feature or request non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

3 participants