column to row refactor for performance #11063

hyperbolic2346 · 2022-06-07T02:15:31Z

Spent some time investigating performance on column to row and row to column conversion code. Lots of wins here. Split PR up, so this is column to row work only. Row to column to follow. Also, planning to add tests and benchmarks into spark-rapids-jni once this is all updated.

Comparing 22.08/ROW_CONVERSION_BENCH to mwilson/column_to_row_optimization/ROW_CONVERSION_BENCH
Benchmark                                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/1048576/manual_time                              +0.0012         +0.0026             5             5             5             5
RowConversion/new_to_row_conversion/1048576/manual_time                              -0.5661         -0.5652            16             7            16             7
RowConversion/new_to_row_extended_conversion/1048576/manual_time                     -0.5680         -0.5655            16             7            16             7
RowConversion/string_to_row_extended_conversion/1048576/manual_time                  -0.5768         -0.5764            27            12            27            12
RowConversion/old_from_row_conversion/1048576/manual_time                            -0.0017         -0.0018             4             4             4             4
RowConversion/new_from_row_conversion/1048576/manual_time                            +1.1629         +1.1619            31            68            32            68
RowConversion/new_from_row_extended_conversion/1048576/manual_time                   +1.1540         +1.1549            31            68            32            68
RowConversion/string_from_row_extended_conversion/1048576/manual_time                +0.7185         +0.6831            39            67            40            68

ttnghia · 2022-06-07T02:35:53Z

It is recommended to use compare.py to show the benchmark changes. Here is what I do:

/usr/bin/python3 cpp/build/_deps/benchmark-src/tools/compare.py benchmarks ./0 ./1

Where ./0 and ./1 are the binaries of different test builds. cudf must be built into static lib (BUILD_SHARE_LIB=OFF).

hyperbolic2346 · 2022-06-07T22:13:00Z

Updated original benchmarks to show differences. The downside on the row to column is due to changes related to this that didn't play nicely with the other side. The other side is the next PR and will fix that though.

mythrocks

A couple of minor questions, and a possible catch.

java/src/main/native/src/row_conversion.cu

mythrocks · 2022-06-14T23:40:25Z

java/src/main/native/src/row_conversion.cu

 #ifdef ASYNC_MEMCPY_SUPPORTED
-    auto &processing_barrier = tile_barrier[processing_index % NUM_TILES_PER_KERNEL_LOADED];
-    processing_barrier.arrive_and_wait();
+  tile_barrier.arrive_and_wait();
 #else
-    group.sync();
+  group.sync();
 #endif // ASYNC_MEMCPY_SUPPORTED


Calling out what we observed here: that we'll need to unconditionally group.sync(), and not wait on the tile_barrier.

This is AMAZING! How did you find this? You are really taking great care to understand the underlying code here and I really appreciate it. This is absolutely an issue that needed to be fixed as it would be a nasty bug to find later!

java/src/main/native/src/row_conversion.cu

mythrocks · 2022-06-14T23:47:14Z

java/src/main/native/src/row_conversion.cu

+  // each warp takes a column with each thread of a warp taking a row
+  for (int relative_col = warp.meta_group_rank(); relative_col < num_tile_cols;
+       relative_col += warp.meta_group_size()) {
+    for (int relative_row = warp.thread_rank(); relative_row < num_tile_rows;


Question for those more knowledgeable that myself: In spite of warp.size() being a constexpr, we would not use #pragma unroll here because the body of the inner for loop is long, right?

With the conditionals inside the loop and the size of the loop I wouldn't expect it to improve things to attempt and unroll it, but I honestly don't know if it would. It comes may depend on how large num_tile_rows/warp.size() is as that is how many iterations each thread will do.

ttnghia · 2022-06-15T17:22:46Z

One more question: Why this is still here, not in rapids-spark-jni?

hyperbolic2346 · 2022-06-15T17:27:07Z

One more question: Why this is still here, not in rapids-spark-jni?

An excellent question. I plan on moving this to spark-rapids-jni shortly. I'm working on adding tests and benchmarks there now and once these two performance PRs are in place I will add a PR to move it out into that repo for good.

java/src/main/native/src/row_conversion.cu

hyperbolic2346 · 2022-06-22T03:10:30Z

@gpucibot merge

This is the second-half of the performance changes for row to column conversions. The first half is in PR #11063 and this half completes the adjustments. Good wins here from changing the way data is read and written. ~Leaving this as draft until #11063 is merged so this cleans up to just what is changed for this portion.~ ```Comparing branch-22.08/ROW_CONVERSION_BENCH to mwilson/row_to-column_optimization/ROW_CONVERSION_BENCH Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------------------------------------------- RowConversion/old_to_row_conversion/1048576/manual_time +0.0020 +0.0020 5 5 5 5 RowConversion/new_to_row_conversion/1048576/manual_time -0.5679 -0.5657 16 7 16 7 RowConversion/new_to_row_extended_conversion/1048576/manual_time -0.5642 -0.5636 16 7 16 7 RowConversion/string_to_row_extended_conversion/1048576/manual_time -0.5734 -0.5743 27 12 27 12 RowConversion/old_from_row_conversion/1048576/manual_time -0.0062 -0.0057 4 4 4 4 RowConversion/new_from_row_conversion/1048576/manual_time -0.7678 -0.7674 31 7 32 7 RowConversion/new_from_row_extended_conversion/1048576/manual_time -0.7680 -0.7675 31 7 32 7 RowConversion/string_from_row_extended_conversion/1048576/manual_time -0.5559 -0.5667 39 17 40 17 ``` The major changes for this performance change revolve around data coalescing. By changing what thread was reading what data and keeping the data a warp is accessing linear in memory, the coalescing was improved significantly here. Specifically: - All double-buffering was removed. This added complexity to the code and also proved to be a slower approach to dedicating a single block to a tile. - `copy_from_rows` Changed from using a thread per row to using a warp per row, allowing more threads to participate in the memcpy of the data. - `copy_validity_from_rows` Changed from a thread copying 8 bytes of validity data and striding to using a warp to read a column of data at a time. closes #10055 closes #10054 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) - https://github.com/nvdbaranec URL: #11075

column to row refactor for performance

40ff381

hyperbolic2346 requested a review from a team as a code owner June 7, 2022 02:15

github-actions bot added the Java Affects Java cuDF API. label Jun 7, 2022

hyperbolic2346 added 3 - Ready for Review Ready for review by team code quality Performance Performance related issue non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jun 7, 2022

This comment was marked as off-topic.

Sign in to view

hyperbolic2346 mentioned this pull request Jun 8, 2022

Performance improvements for row to column conversions #11075

Merged

hyperbolic2346 requested review from nvdbaranec and mythrocks June 8, 2022 17:28

mythrocks requested changes Jun 14, 2022

View reviewed changes

updating from review comments

da7cb88

ttnghia reviewed Jun 15, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Show resolved Hide resolved

ttnghia reviewed Jun 15, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Show resolved Hide resolved

hyperbolic2346 requested a review from mythrocks June 16, 2022 03:55

updating from review comments

0765373

nvdbaranec reviewed Jun 16, 2022

View reviewed changes

updates from review comments

949f53a

hyperbolic2346 requested review from nvdbaranec and ttnghia June 21, 2022 19:33

ttnghia approved these changes Jun 21, 2022

View reviewed changes

nvdbaranec approved these changes Jun 21, 2022

View reviewed changes

mythrocks approved these changes Jun 21, 2022

View reviewed changes

rapids-bot bot merged commit e3d3df3 into rapidsai:branch-22.08 Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

column to row refactor for performance #11063

column to row refactor for performance #11063

hyperbolic2346 commented Jun 7, 2022 •

edited

Loading

ttnghia commented Jun 7, 2022

This comment was marked as off-topic.

hyperbolic2346 commented Jun 7, 2022

mythrocks left a comment

mythrocks Jun 14, 2022

hyperbolic2346 Jun 15, 2022

mythrocks Jun 14, 2022

hyperbolic2346 Jun 15, 2022

ttnghia commented Jun 15, 2022

hyperbolic2346 commented Jun 15, 2022

hyperbolic2346 commented Jun 22, 2022

column to row refactor for performance #11063

column to row refactor for performance #11063

Conversation

hyperbolic2346 commented Jun 7, 2022 • edited Loading

ttnghia commented Jun 7, 2022

This comment was marked as off-topic.

hyperbolic2346 commented Jun 7, 2022

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks Jun 14, 2022

Choose a reason for hiding this comment

hyperbolic2346 Jun 15, 2022

Choose a reason for hiding this comment

mythrocks Jun 14, 2022

Choose a reason for hiding this comment

hyperbolic2346 Jun 15, 2022

Choose a reason for hiding this comment

ttnghia commented Jun 15, 2022

hyperbolic2346 commented Jun 15, 2022

hyperbolic2346 commented Jun 22, 2022

hyperbolic2346 commented Jun 7, 2022 •

edited

Loading