You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
So I have been trying to implement some list aggregations using the existing sorted group by cudf code. At least until #9135 is implemented. I see failures when doing this for strings about half the time with IllegalMemoryAccess errors. I also see similar errors doing max on strings too.
Steps/Code to reproduce bug
I know it is a little convoluted to make it happen, but I finally got a reproducible case in C++.
You need to apply patch.txt and unzip data.zip placing the parquet file in whatever directory you are going to run gtests/GROUPBY_TEST from. The patch disables most of the tests, but also adds a test for min that will read in the parquet file and then do an explode and finally a min aggregation. The exact steps to make this happen are a bit difficult, which is why the test is written the way it is so that it is close to how the java code was releasing things as it went.
About half to a third of the time I see it crash with an error like.
[ RUN ] groupby_min_string_test.min_sorted_after_explode
unknown file: Failure
C++ exception with description "reduce_by_key: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered" thrown in the test body.
[ FAILED ] groupby_min_string_test.min_sorted_after_explode (42 ms)
With other failures after it because the illegal memory access was not cleared. When I run with cuda-memcheck I get log.txt.
Not sure if this is an error in the aggregation code or if something in explode_outer is not working properly. I cannot reproduce it if I save the data after the explode_outer and do the aggregation just from the raw data.
The text was updated successfully, but these errors were encountered:
This appears to be a bug in thrust::reduce_by_key where invalid data is passed to the BinaryFunction operator by it's internal code for certain input vector lengths. The test case provided in the description happened to hit the bug. The random data is sometimes null or 0 but intermittently an invalid device memory pointer is passed causing the crash here.
…9263)
Closes#9156
This PR simplifies the parameters when calling thrust::reduce_by_key for the argmin/argmax aggregations in cudf::groupby. The illegalMemoryAccess found in #9156 was due to invalid data being passed from thrust::reduce_by_key through to the BinaryPredicate function as documented in NVIDIA/thrust#1525
The invalid data being passed is only a real issue for strings columns where the device pointer was neither nullptr nor a valid address. The new logic provides only size_type values to thrust::reduce_by_key so invalid values can only be out-of-bounds for the input column which is easily checked before retrieving the string_view objects within the ArgMin and ArgMax operators.
This the same as #9244 but based on 21.10
Authors:
- David Wendt (https://github.com/davidwendt)
Approvers:
- Devavret Makkar (https://github.com/devavret)
- Nghia Truong (https://github.com/ttnghia)
- Robert Maynard (https://github.com/robertmaynard)
URL: #9263
Describe the bug
So I have been trying to implement some list aggregations using the existing sorted group by cudf code. At least until #9135 is implemented. I see failures when doing this for strings about half the time with IllegalMemoryAccess errors. I also see similar errors doing max on strings too.
Steps/Code to reproduce bug
I know it is a little convoluted to make it happen, but I finally got a reproducible case in C++.
You need to apply patch.txt and unzip data.zip placing the parquet file in whatever directory you are going to run
gtests/GROUPBY_TEST
from. The patch disables most of the tests, but also adds a test for min that will read in the parquet file and then do an explode and finally a min aggregation. The exact steps to make this happen are a bit difficult, which is why the test is written the way it is so that it is close to how the java code was releasing things as it went.About half to a third of the time I see it crash with an error like.
With other failures after it because the illegal memory access was not cleared. When I run with cuda-memcheck I get log.txt.
Not sure if this is an error in the aggregation code or if something in explode_outer is not working properly. I cannot reproduce it if I save the data after the explode_outer and do the aggregation just from the raw data.
The text was updated successfully, but these errors were encountered: