-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
column to row refactor for performance #11063
column to row refactor for performance #11063
Conversation
It is recommended to use
Where |
This comment was marked as off-topic.
This comment was marked as off-topic.
Updated original benchmarks to show differences. The downside on the row to column is due to changes related to this that didn't play nicely with the other side. The other side is the next PR and will fix that though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor questions, and a possible catch.
#ifdef ASYNC_MEMCPY_SUPPORTED | ||
auto &processing_barrier = tile_barrier[processing_index % NUM_TILES_PER_KERNEL_LOADED]; | ||
processing_barrier.arrive_and_wait(); | ||
tile_barrier.arrive_and_wait(); | ||
#else | ||
group.sync(); | ||
group.sync(); | ||
#endif // ASYNC_MEMCPY_SUPPORTED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling out what we observed here: that we'll need to unconditionally group.sync()
, and not wait on the tile_barrier
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is AMAZING! How did you find this? You are really taking great care to understand the underlying code here and I really appreciate it. This is absolutely an issue that needed to be fixed as it would be a nasty bug to find later!
// each warp takes a column with each thread of a warp taking a row | ||
for (int relative_col = warp.meta_group_rank(); relative_col < num_tile_cols; | ||
relative_col += warp.meta_group_size()) { | ||
for (int relative_row = warp.thread_rank(); relative_row < num_tile_rows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for those more knowledgeable that myself: In spite of warp.size()
being a constexpr
, we would not use #pragma unroll
here because the body of the inner for
loop is long, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the conditionals inside the loop and the size of the loop I wouldn't expect it to improve things to attempt and unroll it, but I honestly don't know if it would. It comes may depend on how large num_tile_rows/warp.size()
is as that is how many iterations each thread will do.
One more question: Why this is still here, not in rapids-spark-jni? |
An excellent question. I plan on moving this to |
@gpucibot merge |
This is the second-half of the performance changes for row to column conversions. The first half is in PR #11063 and this half completes the adjustments. Good wins here from changing the way data is read and written. ~Leaving this as draft until #11063 is merged so this cleans up to just what is changed for this portion.~ ```Comparing branch-22.08/ROW_CONVERSION_BENCH to mwilson/row_to-column_optimization/ROW_CONVERSION_BENCH Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------------------------------------------- RowConversion/old_to_row_conversion/1048576/manual_time +0.0020 +0.0020 5 5 5 5 RowConversion/new_to_row_conversion/1048576/manual_time -0.5679 -0.5657 16 7 16 7 RowConversion/new_to_row_extended_conversion/1048576/manual_time -0.5642 -0.5636 16 7 16 7 RowConversion/string_to_row_extended_conversion/1048576/manual_time -0.5734 -0.5743 27 12 27 12 RowConversion/old_from_row_conversion/1048576/manual_time -0.0062 -0.0057 4 4 4 4 RowConversion/new_from_row_conversion/1048576/manual_time -0.7678 -0.7674 31 7 32 7 RowConversion/new_from_row_extended_conversion/1048576/manual_time -0.7680 -0.7675 31 7 32 7 RowConversion/string_from_row_extended_conversion/1048576/manual_time -0.5559 -0.5667 39 17 40 17 ``` The major changes for this performance change revolve around data coalescing. By changing what thread was reading what data and keeping the data a warp is accessing linear in memory, the coalescing was improved significantly here. Specifically: - All double-buffering was removed. This added complexity to the code and also proved to be a slower approach to dedicating a single block to a tile. - `copy_from_rows` Changed from using a thread per row to using a warp per row, allowing more threads to participate in the memcpy of the data. - `copy_validity_from_rows` Changed from a thread copying 8 bytes of validity data and striding to using a warp to read a column of data at a time. closes #10055 closes #10054 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) - https://github.com/nvdbaranec URL: #11075
Spent some time investigating performance on column to row and row to column conversion code. Lots of wins here. Split PR up, so this is column to row work only. Row to column to follow. Also, planning to add tests and benchmarks into
spark-rapids-jni
once this is all updated.