Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

column to row refactor for performance #11063

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Jun 7, 2022

Spent some time investigating performance on column to row and row to column conversion code. Lots of wins here. Split PR up, so this is column to row work only. Row to column to follow. Also, planning to add tests and benchmarks into spark-rapids-jni once this is all updated.

Comparing 22.08/ROW_CONVERSION_BENCH to mwilson/column_to_row_optimization/ROW_CONVERSION_BENCH
Benchmark                                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/1048576/manual_time                              +0.0012         +0.0026             5             5             5             5
RowConversion/new_to_row_conversion/1048576/manual_time                              -0.5661         -0.5652            16             7            16             7
RowConversion/new_to_row_extended_conversion/1048576/manual_time                     -0.5680         -0.5655            16             7            16             7
RowConversion/string_to_row_extended_conversion/1048576/manual_time                  -0.5768         -0.5764            27            12            27            12
RowConversion/old_from_row_conversion/1048576/manual_time                            -0.0017         -0.0018             4             4             4             4
RowConversion/new_from_row_conversion/1048576/manual_time                            +1.1629         +1.1619            31            68            32            68
RowConversion/new_from_row_extended_conversion/1048576/manual_time                   +1.1540         +1.1549            31            68            32            68
RowConversion/string_from_row_extended_conversion/1048576/manual_time                +0.7185         +0.6831            39            67            40            68

@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner June 7, 2022 02:15
@github-actions github-actions bot added the Java Affects Java cuDF API. label Jun 7, 2022
@ttnghia
Copy link
Contributor

ttnghia commented Jun 7, 2022

It is recommended to use compare.py to show the benchmark changes. Here is what I do:

/usr/bin/python3 cpp/build/_deps/benchmark-src/tools/compare.py benchmarks ./0 ./1

Where ./0 and ./1 are the binaries of different test builds. cudf must be built into static lib (BUILD_SHARE_LIB=OFF).

@hyperbolic2346 hyperbolic2346 added 3 - Ready for Review Ready for review by team code quality Performance Performance related issue non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jun 7, 2022
@codecov

This comment was marked as off-topic.

@hyperbolic2346
Copy link
Contributor Author

Updated original benchmarks to show differences. The downside on the row to column is due to changes related to this that didn't play nicely with the other side. The other side is the next PR and will fix that though.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor questions, and a possible catch.

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Show resolved Hide resolved
Comment on lines 673 to 677
#ifdef ASYNC_MEMCPY_SUPPORTED
auto &processing_barrier = tile_barrier[processing_index % NUM_TILES_PER_KERNEL_LOADED];
processing_barrier.arrive_and_wait();
tile_barrier.arrive_and_wait();
#else
group.sync();
group.sync();
#endif // ASYNC_MEMCPY_SUPPORTED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out what we observed here: that we'll need to unconditionally group.sync(), and not wait on the tile_barrier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is AMAZING! How did you find this? You are really taking great care to understand the underlying code here and I really appreciate it. This is absolutely an issue that needed to be fixed as it would be a nasty bug to find later!

java/src/main/native/src/row_conversion.cu Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
// each warp takes a column with each thread of a warp taking a row
for (int relative_col = warp.meta_group_rank(); relative_col < num_tile_cols;
relative_col += warp.meta_group_size()) {
for (int relative_row = warp.thread_rank(); relative_row < num_tile_rows;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for those more knowledgeable that myself: In spite of warp.size() being a constexpr, we would not use #pragma unroll here because the body of the inner for loop is long, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the conditionals inside the loop and the size of the loop I wouldn't expect it to improve things to attempt and unroll it, but I honestly don't know if it would. It comes may depend on how large num_tile_rows/warp.size() is as that is how many iterations each thread will do.

@ttnghia
Copy link
Contributor

ttnghia commented Jun 15, 2022

One more question: Why this is still here, not in rapids-spark-jni?

@hyperbolic2346
Copy link
Contributor Author

One more question: Why this is still here, not in rapids-spark-jni?

An excellent question. I plan on moving this to spark-rapids-jni shortly. I'm working on adding tests and benchmarks there now and once these two performance PRs are in place I will add a PR to move it out into that repo for good.

@hyperbolic2346 hyperbolic2346 requested a review from mythrocks June 16, 2022 03:55
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
@hyperbolic2346
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e3d3df3 into rapidsai:branch-22.08 Jun 22, 2022
rapids-bot bot pushed a commit that referenced this pull request Jul 14, 2022
This is the second-half of the performance changes for row to column conversions. The first half is in PR #11063 and this half completes the adjustments. Good wins here from changing the way data is read and written.

~Leaving this as draft until #11063 is merged so this cleans up to just what is changed for this portion.~

```Comparing branch-22.08/ROW_CONVERSION_BENCH to mwilson/row_to-column_optimization/ROW_CONVERSION_BENCH
Benchmark                                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/1048576/manual_time                              +0.0020         +0.0020             5             5             5             5
RowConversion/new_to_row_conversion/1048576/manual_time                              -0.5679         -0.5657            16             7            16             7
RowConversion/new_to_row_extended_conversion/1048576/manual_time                     -0.5642         -0.5636            16             7            16             7
RowConversion/string_to_row_extended_conversion/1048576/manual_time                  -0.5734         -0.5743            27            12            27            12
RowConversion/old_from_row_conversion/1048576/manual_time                            -0.0062         -0.0057             4             4             4             4
RowConversion/new_from_row_conversion/1048576/manual_time                            -0.7678         -0.7674            31             7            32             7
RowConversion/new_from_row_extended_conversion/1048576/manual_time                   -0.7680         -0.7675            31             7            32             7
RowConversion/string_from_row_extended_conversion/1048576/manual_time                -0.5559         -0.5667            39            17            40            17
```

The major changes for this performance change revolve around data coalescing. By changing what thread was reading what data and keeping the data a warp is accessing linear in memory, the coalescing was improved significantly here. Specifically:

- All double-buffering was removed. This added complexity to the code and also proved to be a slower approach to dedicating a single block to a tile.
- `copy_from_rows` Changed from using a thread per row to using a warp per row, allowing more threads to participate in the memcpy of the data.
- `copy_validity_from_rows` Changed from a thread copying 8 bytes of validity data and striding to using a warp to read a column of data at a time.

closes #10055 
closes #10054

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)
  - https://github.com/nvdbaranec

URL: #11075
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants