Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix call to thrust::reduce_by_key in argmin/argmax libcudf groupby #9263

Merged

Conversation

davidwendt
Copy link
Contributor

Closes #9156

This PR simplifies the parameters when calling thrust::reduce_by_key for the argmin/argmax aggregations in cudf::groupby. The illegalMemoryAccess found in #9156 was due to invalid data being passed from thrust::reduce_by_key through to the BinaryPredicate function as documented in NVIDIA/cccl#789

The invalid data being passed is only a real issue for strings columns where the device pointer was neither nullptr nor a valid address. The new logic provides only size_type values to thrust::reduce_by_key so invalid values can only be out-of-bounds for the input column which is easily checked before retrieving the string_view objects within the ArgMin and ArgMax operators.

This the same as #9244 but based on 21.10

@davidwendt davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Sep 21, 2021
@davidwendt davidwendt self-assigned this Sep 21, 2021
@davidwendt davidwendt requested a review from a team as a code owner September 21, 2021 13:10
@codecov
Copy link

codecov bot commented Sep 21, 2021

Codecov Report

Merging #9263 (979d3ec) into branch-21.10 (3ee3ecf) will decrease coverage by 0.02%.
The diff coverage is 0.00%.

❗ Current head 979d3ec differs from pull request most recent head 50161dd. Consider uploading reports for the commit 50161dd to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.10    #9263      +/-   ##
================================================
- Coverage         10.85%   10.83%   -0.03%     
================================================
  Files               115      116       +1     
  Lines             19158    18781     -377     
================================================
- Hits               2080     2035      -45     
+ Misses            17078    16746     -332     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/multiindex.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/text.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
python/custreamz/custreamz/tests/conftest.py 71.42% <0.00%> (-7.15%) ⬇️
python/custreamz/custreamz/tests/test_kafka.py 35.71% <0.00%> (-7.15%) ⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py 96.97% <0.00%> (-2.42%) ⬇️
python/dask_cudf/dask_cudf/backends.py 82.31% <0.00%> (-0.43%) ⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba2cbd9...50161dd. Read the comment docs.

// The extra bounds checking is due to issue github.com/rapidsai/cudf/9156 and
// github.com/NVIDIA/thrust/issues/1525
// where invalid random values may be passed here by thrust::reduce_by_key
if (lhs < 0 || lhs >= d_col.size() || d_col.is_null(lhs)) { return rhs; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if rhs here is also out of bound or null? Is returning out of bound or null desired?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the bounds checking is necessary seeing as this gets its values from a counting iterator which is always less than the size of d_col.

For null, I do think it's ok to return the index of a null element. If both are null then either can be returned, the winning idx will later be removed when compared against an idx corresponding to valid value. And if the entire group contains nulls then it'll be nullified in the group mask generation step.

Copy link
Contributor Author

@davidwendt davidwendt Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bounds checking is necessary because the two issues mentioned in the comment above.The thrust::reduce_by_key may actually pass an invalid (out-of-bounds) value because in certain cases it does not call through the iterator to retrieve the value it should pass. In these cases, it ignores the output but the damage is done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. So bound check here is only to ensure the line 55 below works correctly, not to ensure the output to be used correctly later (since it is already handled).

// The extra bounds checking is due to issue github.com/rapidsai/cudf/9156 and
// github.com/NVIDIA/thrust/issues/1525
// where invalid random values may be passed here by thrust::reduce_by_key
if (lhs < 0 || lhs >= d_col.size() || d_col.is_null(lhs)) { return rhs; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the bounds checking is necessary seeing as this gets its values from a counting iterator which is always less than the size of d_col.

For null, I do think it's ok to return the index of a null element. If both are null then either can be returned, the winning idx will later be removed when compared against an idx corresponding to valid value. And if the entire group contains nulls then it'll be nullified in the group mask generation step.

return thrust::get<1>(rhs);
// The extra bounds checking is due to issue github.com/rapidsai/cudf/9156 and
// github.com/NVIDIA/thrust/issues/1525
// where invalid random values may be passed here by thrust::reduce_by_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// where invalid random values

Is there ever valid random values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could happen. The data could be randomly valid. Either way, the thrust::reduce_by_key ignores the result -- it is just trying to fill a block/warp with minimal divergence.

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9da7c01 into rapidsai:branch-21.10 Sep 22, 2021
@davidwendt davidwendt deleted the bug-argmin-groupby-string branch September 22, 2021 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] IllegalMemoryAccess sometimes on sorted group by min of strings
4 participants