Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] IllegalMemoryAccess sometimes on sorted group by min of strings #9156

Closed
revans2 opened this issue Aug 31, 2021 · 2 comments · Fixed by #9263
Closed

[BUG] IllegalMemoryAccess sometimes on sorted group by min of strings #9156

revans2 opened this issue Aug 31, 2021 · 2 comments · Fixed by #9263
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 31, 2021

Describe the bug
So I have been trying to implement some list aggregations using the existing sorted group by cudf code. At least until #9135 is implemented. I see failures when doing this for strings about half the time with IllegalMemoryAccess errors. I also see similar errors doing max on strings too.

Steps/Code to reproduce bug

I know it is a little convoluted to make it happen, but I finally got a reproducible case in C++.

You need to apply patch.txt and unzip data.zip placing the parquet file in whatever directory you are going to run gtests/GROUPBY_TEST from. The patch disables most of the tests, but also adds a test for min that will read in the parquet file and then do an explode and finally a min aggregation. The exact steps to make this happen are a bit difficult, which is why the test is written the way it is so that it is close to how the java code was releasing things as it went.

About half to a third of the time I see it crash with an error like.

[ RUN      ] groupby_min_string_test.min_sorted_after_explode
unknown file: Failure
C++ exception with description "reduce_by_key: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] groupby_min_string_test.min_sorted_after_explode (42 ms)

With other failures after it because the illegal memory access was not cleared. When I run with cuda-memcheck I get log.txt.

Not sure if this is an error in the aggregation code or if something in explode_outer is not working properly. I cannot reproduce it if I save the data after the explode_outer and do the aggregation just from the raw data.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 31, 2021
@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 7, 2021
@davidwendt davidwendt self-assigned this Sep 15, 2021
@davidwendt
Copy link
Contributor

This appears to be a bug in thrust::reduce_by_key where invalid data is passed to the BinaryFunction operator by it's internal code for certain input vector lengths. The test case provided in the description happened to hit the bug. The random data is sometimes null or 0 but intermittently an invalid device memory pointer is passed causing the crash here.

I've opened a but with thrust at NVIDIA/cccl#789

Meanwhile, I believe I can code a workaround to this.

@davidwendt
Copy link
Contributor

@revans2 I apologize if you already know this but to execute a single test case you can use the gtest_filter parameter like the following:

gtests/GROUPBY_TEST --gtest_filter=groupby_min_string_test.min_sorted_after_explode

This way, you do not have to comment out all the other tests to execute just one.
This parameter also accepts wildcards for the test name.

rapids-bot bot pushed a commit that referenced this issue Sep 22, 2021
…9263)

Closes #9156

This PR simplifies the parameters when calling thrust::reduce_by_key for the argmin/argmax aggregations in cudf::groupby. The illegalMemoryAccess found in #9156 was due to invalid data being passed from thrust::reduce_by_key through to the BinaryPredicate function as documented in NVIDIA/thrust#1525

The invalid data being passed is only a real issue for strings columns where the device pointer was neither nullptr nor a valid address. The new logic provides only size_type values to thrust::reduce_by_key so invalid values can only be out-of-bounds for the input column which is easily checked before retrieving the string_view objects within the ArgMin and ArgMax operators.

This the same as #9244 but based on 21.10

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #9263
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
3 participants