[Opt] Enforce the UT Coverity and add benchmark for `transpose` #2421

rhdong · 2024-08-27T23:08:47Z

Fix the transpose_half is not compatible with the sub-matrix cases.
Benchmark on A100 with 80GB PCIE:

Running ./cpp/build/bench/prims/LINALG_BENCH
Run on (16 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x4)
Load Average: 4.09, 8.38, 6.61
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------
Benchmark                                                           Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------
TransposeBench<float, int, raft::row_major>/0/manual_time       0.009 ms        0.096 ms        78786 10#10000
TransposeBench<float, int, raft::row_major>/1/manual_time       0.021 ms        0.107 ms        32933 10#100000
TransposeBench<float, int, raft::row_major>/2/manual_time       0.169 ms        0.252 ms         4094 10#1000000
TransposeBench<float, int, raft::row_major>/3/manual_time       0.013 ms        0.101 ms        52695 128#10000
TransposeBench<float, int, raft::row_major>/4/manual_time       0.085 ms        0.173 ms         8084 128#100000
TransposeBench<float, int, raft::row_major>/5/manual_time       0.810 ms        0.898 ms          840 128#1000000
TransposeBench<float, int, raft::row_major>/6/manual_time       0.020 ms        0.108 ms        35381 256#10000
TransposeBench<float, int, raft::row_major>/7/manual_time       0.151 ms        0.239 ms         4559 256#100000
TransposeBench<float, int, raft::row_major>/8/manual_time        1.50 ms         1.59 ms          456 256#1000000
TransposeBench<float, int, raft::row_major>/9/manual_time       0.035 ms        0.124 ms        19852 512#10000
TransposeBench<float, int, raft::row_major>/10/manual_time      0.287 ms        0.375 ms         2430 512#100000
TransposeBench<float, int, raft::row_major>/11/manual_time       2.92 ms         3.02 ms          235 512#1000000
TransposeBench<float, int, raft::row_major>/12/manual_time      0.061 ms        0.150 ms        10898 1024#10000
TransposeBench<float, int, raft::row_major>/13/manual_time      0.556 ms        0.644 ms         1208 1024#100000
TransposeBench<float, int, raft::row_major>/14/manual_time       5.81 ms         5.92 ms          115 1024#1000000
TransposeBench<half, int, raft::row_major>/0/manual_time        0.009 ms        0.096 ms        78054 10#10000
TransposeBench<half, int, raft::row_major>/1/manual_time        0.018 ms        0.104 ms        37908 10#100000
TransposeBench<half, int, raft::row_major>/2/manual_time        0.093 ms        0.176 ms         7538 10#1000000
TransposeBench<half, int, raft::row_major>/3/manual_time        0.011 ms        0.098 ms        59993 128#10000
TransposeBench<half, int, raft::row_major>/4/manual_time        0.052 ms        0.138 ms        13555 128#100000
TransposeBench<half, int, raft::row_major>/5/manual_time        0.364 ms        0.451 ms         1931 128#1000000
TransposeBench<half, int, raft::row_major>/6/manual_time        0.015 ms        0.102 ms        47278 256#10000
TransposeBench<half, int, raft::row_major>/7/manual_time        0.089 ms        0.175 ms         7926 256#100000
TransposeBench<half, int, raft::row_major>/8/manual_time        0.755 ms        0.840 ms          918 256#1000000
TransposeBench<half, int, raft::row_major>/9/manual_time        0.026 ms        0.113 ms        26919 512#10000
TransposeBench<half, int, raft::row_major>/10/manual_time       0.178 ms        0.263 ms         3959 512#100000
TransposeBench<half, int, raft::row_major>/11/manual_time        1.51 ms         1.59 ms          457 512#1000000
TransposeBench<half, int, raft::row_major>/12/manual_time       0.039 ms        0.126 ms        18103 1024#10000
TransposeBench<half, int, raft::row_major>/13/manual_time       0.324 ms        0.410 ms         2178 1024#100000
TransposeBench<half, int, raft::row_major>/14/manual_time        3.49 ms         3.59 ms          196 1024#1000000
TransposeBench<float, int, raft::col_major>/0/manual_time       0.009 ms        0.096 ms        80619 10#10000
TransposeBench<float, int, raft::col_major>/1/manual_time       0.021 ms        0.107 ms        33000 10#100000
TransposeBench<float, int, raft::col_major>/2/manual_time       0.169 ms        0.252 ms         4101 10#1000000
TransposeBench<float, int, raft::col_major>/3/manual_time       0.013 ms        0.101 ms        53351 128#10000
TransposeBench<float, int, raft::col_major>/4/manual_time       0.085 ms        0.173 ms         8087 128#100000
TransposeBench<float, int, raft::col_major>/5/manual_time       0.810 ms        0.898 ms          839 128#1000000
TransposeBench<float, int, raft::col_major>/6/manual_time       0.020 ms        0.108 ms        35044 256#10000
TransposeBench<float, int, raft::col_major>/7/manual_time       0.151 ms        0.239 ms         4560 256#100000
TransposeBench<float, int, raft::col_major>/8/manual_time        1.51 ms         1.60 ms          455 256#1000000
TransposeBench<float, int, raft::col_major>/9/manual_time       0.035 ms        0.124 ms        19759 512#10000
TransposeBench<float, int, raft::col_major>/10/manual_time      0.287 ms        0.376 ms         2429 512#100000
TransposeBench<float, int, raft::col_major>/11/manual_time       2.92 ms         3.02 ms          234 512#1000000
TransposeBench<float, int, raft::col_major>/12/manual_time      0.061 ms        0.150 ms        10891 1024#10000
TransposeBench<float, int, raft::col_major>/13/manual_time      0.556 ms        0.644 ms         1207 1024#100000
TransposeBench<float, int, raft::col_major>/14/manual_time       5.79 ms         5.91 ms          115 1024#1000000
TransposeBench<half, int, raft::col_major>/0/manual_time        0.009 ms        0.096 ms        79522 10#10000
TransposeBench<half, int, raft::col_major>/1/manual_time        0.018 ms        0.104 ms        37535 10#100000
TransposeBench<half, int, raft::col_major>/2/manual_time        0.093 ms        0.176 ms         7536 10#1000000
TransposeBench<half, int, raft::col_major>/3/manual_time        0.011 ms        0.098 ms        61023 128#10000
TransposeBench<half, int, raft::col_major>/4/manual_time        0.052 ms        0.138 ms        13527 128#100000
TransposeBench<half, int, raft::col_major>/5/manual_time        0.364 ms        0.451 ms         1929 128#1000000
TransposeBench<half, int, raft::col_major>/6/manual_time        0.015 ms        0.102 ms        47299 256#10000
TransposeBench<half, int, raft::col_major>/7/manual_time        0.089 ms        0.175 ms         7927 256#100000
TransposeBench<half, int, raft::col_major>/8/manual_time        0.755 ms        0.841 ms          919 256#1000000
TransposeBench<half, int, raft::col_major>/9/manual_time        0.026 ms        0.113 ms        26910 512#10000
TransposeBench<half, int, raft::col_major>/10/manual_time       0.178 ms        0.263 ms         3950 512#100000
TransposeBench<half, int, raft::col_major>/11/manual_time        1.51 ms         1.59 ms          458 512#1000000
TransposeBench<half, int, raft::col_major>/12/manual_time       0.039 ms        0.126 ms        18098 1024#10000
TransposeBench<half, int, raft::col_major>/13/manual_time       0.323 ms        0.409 ms         2166 1024#100000
TransposeBench<half, int, raft::col_major>/14/manual_time        3.48 ms         3.58 ms          197 1024#1000000

achirkin

Do I understand it right, that normally we use cublasgeam() for matrix transpose, and the only reason for the kernel in this file is that cublas<t>geam() does not support the half-precision float?
Maybe this is a little beyond the scope of this PR, but I think we should switch to cublasLt here. The corresponding function is cublasLtMatrixTransform(), which seems to support half data type. We're going to slowly deprecate the usage of old cuBLAS handle, because it's not thread-safe (the handle state includes the CUDA stream, which must not be set concurrently in multiple streams).

achirkin · 2024-09-16T06:24:05Z

cpp/include/raft/linalg/detail/transpose.cuh

@@ -49,7 +51,7 @@ RAFT_KERNEL transpose_half_kernel(IndexType n_rows,

      for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < n_cols && (y + j) < n_rows) {
-          tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * n_cols + x]);
+          tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * stride_in + x]);


I know it's not the part of the change, but it's advisable to use raft's helpers instread of __xxx functions unless you need a specific cache behavior (for which, maybe, we should add more helpers?..)

Suggested change

tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * stride_in + x]);

tile[threadIdx.y + j][threadIdx.x] = raft::ldg(&in[(y + j) * stride_in + x]);

Hi @achirkin , thank you for your suggestion! Yeah, you're right! Because cublas<t>geam()does not support the half. About the cublasLtMatrixTransform. Let me see how many I can change. (would you like to suggest changing this PR or the next separate one to change all of transpose ? )

Hi @achirkin, I just tested the cublasLtMatrixTransform benchmark and found that performance would be lower than the current implementation by around 10~20%. So, could I keep the current implementation temporarily:

transpose_half_kernel cublasLtMatrixTransform rows* cols TransposeBench<half, int, raft::row_major>/0/manual_time 0.006 ms 0.009 ms 10#10000 TransposeBench<half, int, raft::row_major>/1/manual_time 0.016 ms 0.017 ms 10#100000 TransposeBench<half, int, raft::row_major>/2/manual_time 0.122 ms 0.108 ms 10#1000000 TransposeBench<half, int, raft::row_major>/3/manual_time 0.011 ms 0.014 ms 128#10000 TransposeBench<half, int, raft::row_major>/4/manual_time 0.084 ms 0.091 ms 128#100000 TransposeBench<half, int, raft::row_major>/5/manual_time 0.762 ms 0.845 ms 128#1000000 TransposeBench<half, int, raft::row_major>/6/manual_time 0.022 ms 0.023 ms 256#10000 TransposeBench<half, int, raft::row_major>/7/manual_time 0.156 ms 0.186 ms 256#100000 TransposeBench<half, int, raft::row_major>/8/manual_time 1.53 ms 1.80 ms 256#1000000 TransposeBench<half, int, raft::row_major>/9/manual_time 0.035 ms 0.041 ms 512#10000 TransposeBench<half, int, raft::row_major>/10/manual_time 0.310 ms 0.395 ms 512#100000 TransposeBench<half, int, raft::row_major>/11/manual_time 3.09 ms 3.91 ms 512#1000000 TransposeBench<half, int, raft::row_major>/12/manual_time 0.073 ms 0.076 ms 1024#10000 TransposeBench<half, int, raft::row_major>/13/manual_time 0.642 ms 0.796 ms 1024#100000 TransposeBench<half, int, raft::row_major>/14/manual_time 6.29 ms 7.94 ms 1024#1000000 TransposeBench<half, int, raft::col_major>/0/manual_time 0.006 ms 0.009 ms 10#10000 TransposeBench<half, int, raft::col_major>/1/manual_time 0.017 ms 0.017 ms 10#100000 TransposeBench<half, int, raft::col_major>/2/manual_time 0.125 ms 0.109 ms 10#1000000 TransposeBench<half, int, raft::col_major>/3/manual_time 0.011 ms 0.014 ms 128#10000 TransposeBench<half, int, raft::col_major>/4/manual_time 0.084 ms 0.091 ms 128#100000 TransposeBench<half, int, raft::col_major>/5/manual_time 0.762 ms 0.847 ms 128#1000000 TransposeBench<half, int, raft::col_major>/6/manual_time 0.022 ms 0.023 ms 256#10000 TransposeBench<half, int, raft::col_major>/7/manual_time 0.156 ms 0.186 ms 256#100000 TransposeBench<half, int, raft::col_major>/8/manual_time 1.53 ms 1.80 ms 256#1000000 TransposeBench<half, int, raft::col_major>/9/manual_time 0.035 ms 0.041 ms 512#10000 TransposeBench<half, int, raft::col_major>/10/manual_time 0.310 ms 0.396 ms 512#100000 TransposeBench<half, int, raft::col_major>/11/manual_time 3.09 ms 3.91 ms 512#1000000 TransposeBench<half, int, raft::col_major>/12/manual_time 0.073 ms 0.076 ms 1024#10000 TransposeBench<half, int, raft::col_major>/13/manual_time 0.643 ms 0.796 ms 1024#100000 TransposeBench<half, int, raft::col_major>/14/manual_time 6.29 ms 7.95 ms 1024#1000000

@rhdong lets create an issue for Artem’s suggested change and reference it in a todo comment in the corresponding kernel in the code. I think we should investigate this for sure so that we are utilizing math libs where at all possible (and not having to maintain both math libs and our own custom impls) but I do not think the further investigation should hold up this PR.

@rhdong lets create an issue for Artem’s suggested change and reference it in a todo comment in the corresponding kernel in the code. I think we should investigate this for sure so that we are utilizing math libs where at all possible (and not having to maintain both math libs and our own custom impls) but I do not think the further investigation should hold up this PR.

Yeah, here it is: #2436

achirkin

I agree, let's move the cublasLtMatrixTransform discussion to a follow up issue/pr; this one LGTM!

- Fix the `transpose_half` is not compatible with the sub-matrix cases.

rhdong requested review from a team as code owners August 27, 2024 23:08

rhdong requested a review from cjnolet August 27, 2024 23:08

github-actions bot added cpp CMake labels Aug 27, 2024

rhdong added 3 - Ready for Review non-breaking Non-breaking change enhancement New feature or request and removed cpp CMake labels Aug 27, 2024

github-actions bot added cpp CMake labels Aug 27, 2024

rhdong added Benchmarks feature request New feature or request labels Aug 27, 2024

achirkin requested changes Sep 16, 2024

View reviewed changes

achirkin approved these changes Sep 17, 2024

View reviewed changes

rhdong mentioned this pull request Sep 17, 2024

[FEA] Investigate the performance issue of cublasLtMatrixTransform and switch the transpose to it ASAP. #2436

Open

[Opt] Enforce the UT Coverity and add benchmark for transpose

8501d62

- Fix the `transpose_half` is not compatible with the sub-matrix cases.

rhdong force-pushed the rhdong/transpose-ut branch from 1c07d60 to 8501d62 Compare September 18, 2024 20:49

rhdong closed this Sep 18, 2024

rhdong mentioned this pull request Sep 18, 2024

[Opt] Enforce the UT Coverity and add benchmark for transpose #2438

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Opt] Enforce the UT Coverity and add benchmark for `transpose` #2421

[Opt] Enforce the UT Coverity and add benchmark for `transpose` #2421

rhdong commented Aug 27, 2024

achirkin left a comment

achirkin Sep 16, 2024 •

edited

Loading

rhdong Sep 16, 2024

rhdong Sep 16, 2024 •

edited

Loading

cjnolet Sep 16, 2024

rhdong Sep 17, 2024

achirkin left a comment

	tile[threadIdx.y + j][threadIdx.x] = __ldg(&in[(y + j) * stride_in + x]);
	tile[threadIdx.y + j][threadIdx.x] = raft::ldg(&in[(y + j) * stride_in + x]);

[Opt] Enforce the UT Coverity and add benchmark for transpose #2421

[Opt] Enforce the UT Coverity and add benchmark for transpose #2421

Conversation

rhdong commented Aug 27, 2024

achirkin left a comment

Choose a reason for hiding this comment

achirkin Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

rhdong Sep 16, 2024

Choose a reason for hiding this comment

rhdong Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

cjnolet Sep 16, 2024

Choose a reason for hiding this comment

rhdong Sep 17, 2024

Choose a reason for hiding this comment

achirkin left a comment

Choose a reason for hiding this comment

[Opt] Enforce the UT Coverity and add benchmark for `transpose` #2421

[Opt] Enforce the UT Coverity and add benchmark for `transpose` #2421

achirkin Sep 16, 2024 •

edited

Loading

rhdong Sep 16, 2024 •

edited

Loading