Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fused_concatenate_kernel can cause illegal memory access #10333

Closed
abellina opened this issue Feb 18, 2022 · 2 comments · Fixed by #10344
Closed

[BUG] fused_concatenate_kernel can cause illegal memory access #10333

abellina opened this issue Feb 18, 2022 · 2 comments · Fixed by #10344
Assignees
Labels
bug Something isn't working

Comments

@abellina
Copy link
Contributor

abellina commented Feb 18, 2022

When concatenating via the fused_concatenate_kernel we see an issue where its strided loop can overflow, even if the output number of rows is < 2B. Here is a small scala repro case:

import ai.rapids.cudf.{ColumnVector, Cuda, Scalar, Table}
val s = Scalar.fromByte(1.toByte)
val tbls = (0 until 5).map { _ => new Table(ColumnVector.fromScalar(s, 250 * 1000 * 1000)) }
Table.concatenate(tbls:_*)
Cuda.DEFAULT_STREAM.sync

Will produce:

ai.rapids.cudf.CudaException: an illegal memory access was encountered
  at ai.rapids.cudf.Cuda.streamSynchronize(Native Method)
  at ai.rapids.cudf.Cuda$Stream.sync(Cuda.java:111)
  ... 47 elided

After adding printfs, the grid size is very large and the stride update: https://github.com/rapidsai/cudf/blob/branch-22.04/cpp/src/copying/concatenate.cu#L200 can cause the size_type index to become negative.

It seems the simplest fix is to change output_index to be std::size_t (https://github.com/rapidsai/cudf/blob/branch-22.04/cpp/src/copying/concatenate.cu#L200).

Thanks to @nvdbaranec and @jlowe for debugging this with me.

@abellina abellina added bug Something isn't working Needs Triage Need team to review and classify labels Feb 18, 2022
@abellina abellina changed the title [BUG] fused_concatenate_kernel can overflow [BUG] fused_concatenate_kernel can cause illegal memory access Feb 18, 2022
@abellina abellina self-assigned this Feb 18, 2022
@abellina
Copy link
Contributor Author

I'll have this up for review today. I added a test that reproduces it and building cuDF with the solution.

@abellina
Copy link
Contributor Author

I put up the PR #10344, to address the fused_concatenate_kernel specifically.

I've seen the same bug in other kernels. I am looking for input on how that should be handled. Does it seem we want 1 PR that handles the issue as a whole?

For example:

  • fused_concatenate_string_offset_kernel, fused_concatenate_string_chars_kernel
  • get_json_object_kernel

rapids-bot bot pushed a commit that referenced this issue Feb 28, 2022
Fixes #10333.

The repro case in the issue showed an illegal access error where the `output_index` of the strided loop in `fused_concatenate_kernel` can overflow for a large number of rows. 

For example, given 5 tables of exactly 250M rows each we would expect a result with 1,250,000,000 rows. 

The kernel is launched with 4,882,813 blocks (# of rows / 256 threads rounded up) with a stride of 1,250,000,128 (256 * 4,882,813). When `output_index` reaches 897,483,520, it overflows `output_index` on the first iteration.

The change below prevents the overflow by making `output_index` an `int64_t` and adds a test that shows that we can now concatenate up to `size_type::max - 1` rows.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Jake Hemstad (https://github.com/jrhemstad)
  - MithunR (https://github.com/mythrocks)

URL: #10344
rapids-bot bot pushed a commit that referenced this issue Aug 16, 2023
If data is sufficiently large, `fused_concatenate_string_chars_kernel` will attempt to read out of bounds and ultimately cause CUDA to raise `cudaErrorIllegalAddress`. Details on how the issue was encountered are in #13771, although this was an [already known problem](#10333 (comment)).

Fixes #13771 .

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13838
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants