-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize gpuInitStringDescriptors
for fixed length byte array data
#16109
Parallelize gpuInitStringDescriptors
for fixed length byte array data
#16109
Conversation
Performance ImprovementMarginal only. Comparison ran several times to and similar improvements seen. Measured by running Nsight systems on the Gtest Testbed:
Old:
New:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, few optional suggestions.
@@ -277,6 +279,7 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size) | |||
} | |||
// this needs to be here to prevent warp 3 modifying src_pos before all threads have read it | |||
__syncthreads(); | |||
auto const tile32 = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block()); | |||
if (t < 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to use tile32
here since we already have it, but I'm not convinced it can be done in a simple way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right that we could replace this t < 32
with tile_warp.meta_group_rank() == 0
and it should be good but the logic at L289 is messier to replace since it may be tile_warp.meta_group_rank() == 0 or 1
depending on out_threads0 == 32 or 64
so I left this as is for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could/should probably port the whole logic to thread_group
s and avoid the magic 32 multiples. I'd expect that it would not be more complex than the current logic.
Not something for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One clarifying question but looks good to me otherwise!
Co-authored-by: Vukasin Milovanovic <[email protected]> Co-authored-by: Shruti Shivakumar <[email protected]>
Co-authored-by: Yunsong Wang <[email protected]>
/merge |
Description
Closes #14113
This PR parallelizes the
gpuInitStringDescriptors
function for the fixed length byte array (FLBA) data at either warp or thread block level via cooperative groups. The function continues to execute serially (thread rank 0 in the group) for variable length arrays.CC: @etseidl
Checklist