-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix libcudf memory errors #8884
Fix libcudf memory errors #8884
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #8884 +/- ##
===============================================
Coverage ? 10.60%
===============================================
Files ? 116
Lines ? 19003
Branches ? 0
===============================================
Hits ? 2015
Misses ? 16988
Partials ? 0 Continue to review full report at Codecov.
|
@nvdbaranec I am debugging an out of memory access failure. cudf/cpp/tests/copying/concatenate_tests.cu Line 485 in 7de2dc8
cudf/cpp/tests/copying/concatenate_tests.cu Line 516 in 7de2dc8
Should the chars column size be |
Correct, that looks like the right fix. |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
rerun tests |
rerun tests |
Dave Baranec on PTO. Already addressed his review comments.
@gpucibot merge |
Reference issue #8883 and depends on PR #8884 The `PARQUET_TEST` fails with cuda-memcheck when called with the `rmm_mode=cuda` parameter. The 4-byte read error is caused by this code logic: ``` // Scan to get distance by which each offset value is shifted due to the insertion of empties auto scan_it = cudf::detail::make_counting_transform_iterator( column_offsets[level], [off = lcv.offsets().data<size_type>()] __device__(auto i) -> int { return off[i] == off[i + 1]; }); rmm::device_uvector<size_type> scan_out(offset_size_at_level, stream); thrust::exclusive_scan(rmm::exec_policy(stream), scan_it, scan_it + offset_size_at_level, scan_out.begin()); ``` The `scan_it` lambda will read one offset value passed the end due to the `off[i + 1]` statement. The `exclusive_scan` does not actually use the last element so the code was modified to just return `false` if the index, `i` is greater than or equal to the size of the offsets column. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Devavret Makkar (https://github.com/devavret) URL: #8995
…8994) Reference issue #8883 and depends on fixes in PR #8884 The `get_list_child_to_list_row_mapping` builds a map for rolling operation on a lists column. In the `thrust::scatter` call a map value includes the last offset which will always be out-of-bounds to given output vector. This output vector is used to build the resultant output map by calling `thrust::inclusive_scan` but the out-of-bounds offset value is not used -- which is why the utility does not fail. The fix in this PR simply allocates an extra row in the intermediate vector so the `thrust::scatter` will not write to out-of-bounds memory. Since the value is eventually ignored, it does not effect the result. The code in this function was creating many temporary columns incorrectly using the passed in `device_resource_manager` variable `mr`. The code was corrected by changing these to be just `device_uvector's` instead making it more clear that these are internal temporary memory buffers. Further the code calling `get_list_child_to_list_row_mapping` utility is using the output as a temporary column and so this PR fixes the logic to correct the memory resource usage. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - MithunR (https://github.com/mythrocks) - Karthikeyan (https://github.com/karthikeyann) URL: #8994
Fixes #8883
All memory errors caught in libcudf unit tests are fixed in this PR.
These unit tests are checked with
cuda-memcheck
before and after the fix.The following tests FAILED: