-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid overflow in fused_concatenate_kernel output_index #10344
Avoid overflow in fused_concatenate_kernel output_index #10344
Conversation
It looks like the python builds are failing with:
|
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10344 +/- ##
================================================
+ Coverage 10.42% 10.62% +0.19%
================================================
Files 119 122 +3
Lines 20603 20973 +370
================================================
+ Hits 2148 2228 +80
- Misses 18455 18745 +290
Continue to review full report at Codecov.
|
cpp/src/copying/concatenate.cu
Outdated
@@ -166,7 +166,7 @@ __global__ void fused_concatenate_kernel(column_device_view const* input_views, | |||
auto const output_size = output_view.size(); | |||
auto* output_data = output_view.data<T>(); | |||
|
|||
size_type output_index = threadIdx.x + blockIdx.x * blockDim.x; | |||
std::size_t output_index = threadIdx.x + blockIdx.x * blockDim.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using output_index
in size_t
may lead to some casting problem later, since it will be compared with many other indices in int32_t
. We just need to detect if output_index
here is overflow. So let's try:
std::size_t output_index = threadIdx.x + blockIdx.x * blockDim.x; | |
std::size_t const tmp_index = threadIdx.x + blockIdx.x * blockDim.x; | |
if(tmp_index > std::numeric_limit<size_type>::max()) { return; } | |
std::size_t output_index = static_cast<size_type>(tmp_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just keep with using a 64 bit integer type, but let's use int64_t
instead of size_t
to avoid unsigned comparison warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttnghia I moved to int64_t
given @jrhemstad's comment. I looked at the comparisons with output_index
, and I believe types are upgraded correctly so we shouldn't see an issue at least as the code is written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't decltype(threadIdx.x + blockIdx.x * blockDim.x) == int32_t
, because the operands are that type?
Would it be preferable to cast an operand to int64_t
?
auto const tmp_index = threadIdx.x + static_cast<int64_t>(blockIdx.x)*blockDim.x;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the bug is actually later, and I don't think we can overflow the usual threadIdx.x + blockIdx.x * blockDim.x
math given the 256 threads per block, the way we compute the number of blocks, and row limits in cuDF, if I understand your comment @mythrocks.
The real bug is later on at:
output_index += blockDim.x * gridDim.x;
as mentioned in the PR description, that is because output_index + (blockDim.x * gridDim.x)
can be larger than int32_t
.
But this brings up a point. fused_concatenate
checks here that output_size
needs to be less than size_type::max (2^31-1
). That seems like a bug, but I am probably mistaken. Should it be output_size - 1 <= static_cast<std::size_t>(std::numeric_limits<size_type>::max()
? I.e. if my type supports numbers from 0 to 10, and I have 11 rows, that should be OK (I can address with index 0..10). The reason I ask for this is because I wanted to test I can materialize the maximum number of rows from the concatenate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an implicit limitation to INT_MAX - 1
elements in a column for several functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an implicit limitation to INT_MAX - 1 elements in a column for several functions.
Ok, so it is a cuDF limitation or bug? I would have expected the number of elements in the column to be up to INT_MAX
, so that we can index using 0..INT_MAX - 1
.
Also I think the change should be to make the assert:
output_size <= static_cast<std::size_t>(std::numeric_limits<size_type>::max()
But I am confused by your comment @jrhemstad. Should I make this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I am confused by your comment @jrhemstad. Should I make this change?
Feel free to change it. I'm just saying, don't expect a column with INT_MAX
elements to work everywhere 😉
@jrhemstad @mythrocks are you OK if we merge this? I think it is ready if you are OK with it. |
@gpucibot merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes. I thought I'd approved this already.
Fixes #10333.
The repro case in the issue showed an illegal access error where the
output_index
of the strided loop infused_concatenate_kernel
can overflow for a large number of rows.For example, given 5 tables of exactly 250M rows each we would expect a result with 1,250,000,000 rows.
The kernel is launched with 4,882,813 blocks (# of rows / 256 threads rounded up) with a stride of 1,250,000,128 (256 * 4,882,813). When
output_index
reaches 897,483,520, it overflowsoutput_index
on the first iteration.The change below prevents the overflow by making
output_index
anint64_t
and adds a test that shows that we can now concatenate up tosize_type::max - 1
rows.