-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up and simplify gpuDecideCompression
#13202
Clean up and simplify gpuDecideCompression
#13202
Conversation
…clean-up-decide-compression
…e/cudf into clean-up-decide-compression
compressed_data_size += comp_res->bytes_written; | ||
if (comp_res->status != compression_status::SUCCESS) { atomicAdd(&error_count, 1); } | ||
} | ||
__syncwarp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write this kernel in a block-size agnostic way? Unlike __syncthreads();
, using __syncwarp();
assumes that block_size == warp_size == 32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on how we would scale the parallelism with multiple warps. If any warps worked on a single chunks
element, then, yes, we would need to syn all threads in the block. But, with multiple warps, IMO this kernel should actually have each warp would work on a separate chunks
element. In this case we don't need to synchronize different warps and __syncwarp
is still the right option.
I understand that my change left this ambiguous as warp size is used interchangeably for block and warp size. I'll try to make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the kernel to work with any number of warps in a block. The size can be adjusted via constexpr
decide_compression_warps_in_block
. Used warp_size as well, so we should be magic number-free now :)
…clean-up-decide-compression
…clean-up-decide-compression
auto const lane_id = threadIdx.x % cudf::detail::warp_size; | ||
auto const warp_id = threadIdx.x / cudf::detail::warp_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for the reviewers: Are there maybe helper functions for this? Looks very generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I am aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more fix for warpSize
. Otherwise I think this is better!
auto const lane_id = threadIdx.x % cudf::detail::warp_size; | ||
auto const warp_id = threadIdx.x / cudf::detail::warp_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I am aware of.
Co-authored-by: Bradley Dice <[email protected]>
__shared__ __align__(8) EncColumnChunk ck_g[decide_compression_warps_in_block]; | ||
__shared__ __align__(4) unsigned int compression_error[decide_compression_warps_in_block]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we align them manually? And why do we need to align them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows more efficient access, at least in theory. I'm not the one who added the alignment, and I also haven't tested how this alignment impacts performance in practice.
…clean-up-decide-compression
/merge |
Description
Changed the block size to single warp, since only 32 threads are used in the kernel.
Simplify the kernel logic a bit and remove unnecessary atomic operations.
FWIW, the kernel is faster now; not important as it is a tiny part of E2E time.
Checklist