Avoid overflow in fused_concatenate_kernel output_index #10344

abellina · 2022-02-22T21:26:20Z

The repro case in the issue showed an illegal access error where the output_index of the strided loop in fused_concatenate_kernel can overflow for a large number of rows.

For example, given 5 tables of exactly 250M rows each we would expect a result with 1,250,000,000 rows.

The kernel is launched with 4,882,813 blocks (# of rows / 256 threads rounded up) with a stride of 1,250,000,128 (256 * 4,882,813). When output_index reaches 897,483,520, it overflows output_index on the first iteration.

The change below prevents the overflow by making output_index an int64_t and adds a test that shows that we can now concatenate up to size_type::max - 1 rows.

abellina · 2022-02-22T22:33:59Z

It looks like the python builds are failing with:

[](https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-python-build-arm64/CUDA=11.5,PYTHON=3.8/2670/console#L901)16:21:00 conda.CondaMultiError: Downloaded bytes did not match Content-Length
16:21:00   url: https://conda.anaconda.org/nvidia/linux-aarch64/cudatoolkit-11.5.1-h506b3e9_9.tar.bz2[](https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-python-build-arm64/CUDA=11.5,PYTHON=3.8/2670/console#L902)
16:21:00   target_path: /opt/conda/envs/rapids/pkgs/cudatoolkit-11.5.1-h506b3e9_9.tar.bz2

codecov · 2022-02-22T22:50:18Z

Codecov Report

Merging #10344 (3adaa31) into branch-22.04 (a7d88cd) will increase coverage by 0.19%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.04   #10344      +/-   ##
================================================
+ Coverage         10.42%   10.62%   +0.19%     
================================================
  Files               119      122       +3     
  Lines             20603    20973     +370     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18745     +290

Impacted Files	Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py	`0.00% <ø> (ø)`
python/cudf/cudf/_fuzz_testing/io.py	`0.00% <ø> (ø)`
python/cudf/cudf/_fuzz_testing/main.py	`0.00% <ø> (ø)`
python/cudf/cudf/_lib/strings/__init__.py	`0.00% <ø> (ø)`
python/cudf/cudf/_version.py	`0.00% <ø> (ø)`
python/cudf/cudf/comm/gpuarrow.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/_base_index.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/categorical.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/column.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/datetime.py	`0.00% <ø> (ø)`
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf65ac3...3adaa31. Read the comment docs.

ttnghia · 2022-02-23T04:04:31Z

cpp/src/copying/concatenate.cu

@@ -166,7 +166,7 @@ __global__ void fused_concatenate_kernel(column_device_view const* input_views,
  auto const output_size = output_view.size();
  auto* output_data      = output_view.data<T>();

-  size_type output_index     = threadIdx.x + blockIdx.x * blockDim.x;
+  std::size_t output_index   = threadIdx.x + blockIdx.x * blockDim.x;


Using output_index in size_t may lead to some casting problem later, since it will be compared with many other indices in int32_t. We just need to detect if output_index here is overflow. So let's try:

Suggested change

std::size_t output_index = threadIdx.x + blockIdx.x * blockDim.x;

std::size_t const tmp_index = threadIdx.x + blockIdx.x * blockDim.x;

if(tmp_index > std::numeric_limit<size_type>::max()) { return; }

std::size_t output_index = static_cast<size_type>(tmp_index);

Let's just keep with using a 64 bit integer type, but let's use int64_t instead of size_t to avoid unsigned comparison warnings.

@ttnghia I moved to int64_t given @jrhemstad's comment. I looked at the comparisons with output_index, and I believe types are upgraded correctly so we shouldn't see an issue at least as the code is written.

Yeah, agree :)

Isn't decltype(threadIdx.x + blockIdx.x * blockDim.x) == int32_t, because the operands are that type?
Would it be preferable to cast an operand to int64_t?

auto const tmp_index = threadIdx.x + static_cast<int64_t>(blockIdx.x)*blockDim.x;

So the bug is actually later, and I don't think we can overflow the usual threadIdx.x + blockIdx.x * blockDim.x math given the 256 threads per block, the way we compute the number of blocks, and row limits in cuDF, if I understand your comment @mythrocks.

The real bug is later on at:

output_index += blockDim.x * gridDim.x;

as mentioned in the PR description, that is because output_index + (blockDim.x * gridDim.x) can be larger than int32_t.

But this brings up a point. fused_concatenate checks here that output_size needs to be less than size_type::max (2^31-1). That seems like a bug, but I am probably mistaken. Should it be output_size - 1 <= static_cast<std::size_t>(std::numeric_limits<size_type>::max()? I.e. if my type supports numbers from 0 to 10, and I have 11 rows, that should be OK (I can address with index 0..10). The reason I ask for this is because I wanted to test I can materialize the maximum number of rows from the concatenate.

I think there's an implicit limitation to INT_MAX - 1 elements in a column for several functions.

I think there's an implicit limitation to INT_MAX - 1 elements in a column for several functions.

Ok, so it is a cuDF limitation or bug? I would have expected the number of elements in the column to be up to INT_MAX, so that we can index using 0..INT_MAX - 1.

Also I think the change should be to make the assert:

output_size <= static_cast<std::size_t>(std::numeric_limits<size_type>::max()

But I am confused by your comment @jrhemstad. Should I make this change?

But I am confused by your comment @jrhemstad. Should I make this change?

Feel free to change it. I'm just saying, don't expect a column with INT_MAX elements to work everywhere 😉

abellina · 2022-02-25T19:15:35Z

@jrhemstad @mythrocks are you OK if we merge this? I think it is ready if you are OK with it.

jrhemstad · 2022-02-28T14:54:51Z

@gpucibot merge

mythrocks

Yikes. I thought I'd approved this already.

Avoid overflow in fused_concatenate_kernel output_index

bdd0317

abellina requested a review from a team as a code owner February 22, 2022 21:26

abellina requested review from mythrocks and ttnghia February 22, 2022 21:26

abellina added the non-breaking label Feb 22, 2022

github-actions bot added the libcudf label Feb 22, 2022

abellina added bug and removed libcudf labels Feb 22, 2022

abellina mentioned this pull request Feb 22, 2022

[BUG] fused_concatenate_kernel can cause illegal memory access #10333

Closed

Fix styles

94d5808

github-actions bot added the libcudf label Feb 22, 2022

Fix copyrights

b3f9c7a

ttnghia reviewed Feb 23, 2022

View reviewed changes

Use int64_t instead of std::size_t

2527ab1

ttnghia approved these changes Feb 23, 2022

View reviewed changes

abellina added 2 commits February 24, 2022 14:37

Allow up to INT_MAX elements as a result of fused_concatenate

f135b2b

Fix comment in test

3adaa31

jrhemstad approved these changes Feb 28, 2022

View reviewed changes

mythrocks approved these changes Feb 28, 2022

View reviewed changes

rapids-bot bot merged commit 64ee514 into rapidsai:branch-22.04 Feb 28, 2022

abellina deleted the fused_concatenate_illegal_access branch February 28, 2022 18:14

nvdbaranec mentioned this pull request Mar 15, 2022

Prevent grid stride loop overflow in libcudf kernels #10368

Closed

pentschev mentioned this pull request Aug 9, 2023

[BUG] Errors converting tables from arrow to cuDF #13771

Closed

bdice mentioned this pull request Aug 9, 2023

Fix read out of bounds in string concatenate #13838

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid overflow in fused_concatenate_kernel output_index #10344

Avoid overflow in fused_concatenate_kernel output_index #10344

abellina commented Feb 22, 2022 •

edited

Loading

abellina commented Feb 22, 2022 •

edited

Loading

codecov bot commented Feb 22, 2022 •

edited

Loading

ttnghia Feb 23, 2022 •

edited

Loading

jrhemstad Feb 23, 2022

abellina Feb 23, 2022

ttnghia Feb 23, 2022

mythrocks Feb 23, 2022

abellina Feb 23, 2022

jrhemstad Feb 23, 2022

abellina Feb 24, 2022 •

edited

Loading

jrhemstad Feb 24, 2022

abellina commented Feb 25, 2022

jrhemstad commented Feb 28, 2022

mythrocks left a comment

-  std::size_t output_index   = threadIdx.x + blockIdx.x * blockDim.x;
+  std::size_t const tmp_index = threadIdx.x + blockIdx.x * blockDim.x;
+  if(tmp_index > std::numeric_limit<size_type>::max()) { return; }
+  std::size_t output_index   = static_cast<size_type>(tmp_index);

Avoid overflow in fused_concatenate_kernel output_index #10344

Avoid overflow in fused_concatenate_kernel output_index #10344

Conversation

abellina commented Feb 22, 2022 • edited Loading

abellina commented Feb 22, 2022 • edited Loading

codecov bot commented Feb 22, 2022 • edited Loading

Codecov Report

ttnghia Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

jrhemstad Feb 23, 2022

Choose a reason for hiding this comment

abellina Feb 23, 2022

Choose a reason for hiding this comment

ttnghia Feb 23, 2022

Choose a reason for hiding this comment

mythrocks Feb 23, 2022

Choose a reason for hiding this comment

abellina Feb 23, 2022

Choose a reason for hiding this comment

jrhemstad Feb 23, 2022

Choose a reason for hiding this comment

abellina Feb 24, 2022 • edited Loading

Choose a reason for hiding this comment

jrhemstad Feb 24, 2022

Choose a reason for hiding this comment

abellina commented Feb 25, 2022

jrhemstad commented Feb 28, 2022

mythrocks left a comment

Choose a reason for hiding this comment

abellina commented Feb 22, 2022 •

edited

Loading

abellina commented Feb 22, 2022 •

edited

Loading

codecov bot commented Feb 22, 2022 •

edited

Loading

ttnghia Feb 23, 2022 •

edited

Loading

abellina Feb 24, 2022 •

edited

Loading