[BUG] `fused_concatenate_kernel` can cause illegal memory access #10333

abellina · 2022-02-18T22:55:24Z

When concatenating via the fused_concatenate_kernel we see an issue where its strided loop can overflow, even if the output number of rows is < 2B. Here is a small scala repro case:

import ai.rapids.cudf.{ColumnVector, Cuda, Scalar, Table}
val s = Scalar.fromByte(1.toByte)
val tbls = (0 until 5).map { _ => new Table(ColumnVector.fromScalar(s, 250 * 1000 * 1000)) }
Table.concatenate(tbls:_*)
Cuda.DEFAULT_STREAM.sync

Will produce:

ai.rapids.cudf.CudaException: an illegal memory access was encountered
  at ai.rapids.cudf.Cuda.streamSynchronize(Native Method)
  at ai.rapids.cudf.Cuda$Stream.sync(Cuda.java:111)
  ... 47 elided

After adding printfs, the grid size is very large and the stride update: https://github.com/rapidsai/cudf/blob/branch-22.04/cpp/src/copying/concatenate.cu#L200 can cause the size_type index to become negative.

It seems the simplest fix is to change output_index to be std::size_t (https://github.com/rapidsai/cudf/blob/branch-22.04/cpp/src/copying/concatenate.cu#L200).

Thanks to @nvdbaranec and @jlowe for debugging this with me.

The text was updated successfully, but these errors were encountered:

abellina · 2022-02-22T17:24:54Z

I'll have this up for review today. I added a test that reproduces it and building cuDF with the solution.

abellina · 2022-02-22T21:29:19Z

I put up the PR #10344, to address the fused_concatenate_kernel specifically.

I've seen the same bug in other kernels. I am looking for input on how that should be handled. Does it seem we want 1 PR that handles the issue as a whole?

For example:

fused_concatenate_string_offset_kernel, fused_concatenate_string_chars_kernel
get_json_object_kernel

Fixes #10333. The repro case in the issue showed an illegal access error where the `output_index` of the strided loop in `fused_concatenate_kernel` can overflow for a large number of rows. For example, given 5 tables of exactly 250M rows each we would expect a result with 1,250,000,000 rows. The kernel is launched with 4,882,813 blocks (# of rows / 256 threads rounded up) with a stride of 1,250,000,128 (256 * 4,882,813). When `output_index` reaches 897,483,520, it overflows `output_index` on the first iteration. The change below prevents the overflow by making `output_index` an `int64_t` and adds a test that shows that we can now concatenate up to `size_type::max - 1` rows. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Nghia Truong (https://github.com/ttnghia) - Jake Hemstad (https://github.com/jrhemstad) - MithunR (https://github.com/mythrocks) URL: #10344

If data is sufficiently large, `fused_concatenate_string_chars_kernel` will attempt to read out of bounds and ultimately cause CUDA to raise `cudaErrorIllegalAddress`. Details on how the issue was encountered are in #13771, although this was an [already known problem](#10333 (comment)). Fixes #13771 . Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #13838

abellina added bug Something isn't working Needs Triage Need team to review and classify labels Feb 18, 2022

abellina mentioned this issue Feb 18, 2022

[BUG] cudaErrorIllegalAddress for q95 (3TB) on GCP with ASYNC allocator NVIDIA/spark-rapids#4710

Closed

abellina changed the title ~~[BUG] fused_concatenate_kernel can overflow~~ [BUG] fused_concatenate_kernel can cause illegal memory access Feb 18, 2022

abellina self-assigned this Feb 18, 2022

abellina mentioned this issue Feb 22, 2022

Avoid overflow in fused_concatenate_kernel output_index #10344

Merged

rapids-bot bot closed this as completed in #10344 Feb 28, 2022

nvdbaranec mentioned this issue Feb 28, 2022

Prevent grid stride loop overflow in libcudf kernels #10368

Open

This was referenced Aug 9, 2023

Fix read out of bounds in string concatenate #13838

Merged

[BUG] Errors converting tables from arrow to cuDF #13771

Closed

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `fused_concatenate_kernel` can cause illegal memory access #10333

[BUG] `fused_concatenate_kernel` can cause illegal memory access #10333

abellina commented Feb 18, 2022 •

edited

Loading

abellina commented Feb 22, 2022

abellina commented Feb 22, 2022

[BUG] fused_concatenate_kernel can cause illegal memory access #10333

[BUG] fused_concatenate_kernel can cause illegal memory access #10333

Comments

abellina commented Feb 18, 2022 • edited Loading

abellina commented Feb 22, 2022

abellina commented Feb 22, 2022

[BUG] `fused_concatenate_kernel` can cause illegal memory access #10333

[BUG] `fused_concatenate_kernel` can cause illegal memory access #10333

abellina commented Feb 18, 2022 •

edited

Loading