Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix libcudf memory errors #8884

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Jul 28, 2021

Fixes #8883

All memory errors caught in libcudf unit tests are fixed in this PR.
These unit tests are checked with cuda-memcheck before and after the fix.

The following tests FAILED:

  • 2 - SCALAR_TEST (Failed) (Fixed)
  • 32 - COPYING_TEST (Failed) (Fixed)
  • 39 - MERGE_TEST (Failed) (Fixed)
  • 46 - FACTORIES_TEST (Failed) (Fixed by scalar test fix)

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 28, 2021
@karthikeyann karthikeyann added 2 - In Progress Currently a work in progress bug Something isn't working non-breaking Non-breaking change labels Jul 28, 2021
@codecov
Copy link

codecov bot commented Jul 28, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@115f3b6). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 2770901 differs from pull request most recent head 571bb5a. Consider uploading reports for the commit 571bb5a to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #8884   +/-   ##
===============================================
  Coverage                ?   10.60%           
===============================================
  Files                   ?      116           
  Lines                   ?    19003           
  Branches                ?        0           
===============================================
  Hits                    ?     2015           
  Misses                  ?    16988           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 115f3b6...571bb5a. Read the comment docs.

@karthikeyann
Copy link
Contributor Author

@nvdbaranec I am debugging an out of memory access failure.
In the following 2 lines in unit tests,

auto many_chars = cudf::make_fixed_width_column(data_type{type_id::INT8}, num_rows);

auto many_chars = cudf::make_fixed_width_column(data_type{type_id::INT8}, num_rows);

Should the chars column size be total_chars_size ?

@karthikeyann karthikeyann marked this pull request as ready for review August 1, 2021 19:28
@karthikeyann karthikeyann requested a review from a team as a code owner August 1, 2021 19:28
@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 1, 2021
cpp/tests/utilities/column_utilities.cu Outdated Show resolved Hide resolved
@nvdbaranec
Copy link
Contributor

@nvdbaranec I am debugging an out of memory access failure.
In the following 2 lines in unit tests,

auto many_chars = cudf::make_fixed_width_column(data_type{type_id::INT8}, num_rows);

auto many_chars = cudf::make_fixed_width_column(data_type{type_id::INT8}, num_rows);

Should the chars column size be total_chars_size ?

Correct, that looks like the right fix.

@karthikeyann
Copy link
Contributor Author

rerun tests

@karthikeyann karthikeyann requested a review from nvdbaranec August 3, 2021 14:30
@karthikeyann karthikeyann requested a review from nvdbaranec August 6, 2021 07:20
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karthikeyann
Copy link
Contributor Author

rerun tests

@karthikeyann
Copy link
Contributor Author

rerun tests

@davidwendt
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d22fdcf into rapidsai:branch-21.10 Aug 12, 2021
rapids-bot bot pushed a commit that referenced this pull request Aug 13, 2021
Reference issue #8883 and depends on PR #8884

The `PARQUET_TEST` fails with cuda-memcheck when called with the `rmm_mode=cuda` parameter. The 4-byte read error is caused by this code logic:

```
// Scan to get distance by which each offset value is shifted due to the insertion of empties
auto scan_it = cudf::detail::make_counting_transform_iterator( column_offsets[level],
  [off = lcv.offsets().data<size_type>()] __device__(auto i) -> int { return off[i] == off[i + 1]; });
rmm::device_uvector<size_type> scan_out(offset_size_at_level, stream);
thrust::exclusive_scan(rmm::exec_policy(stream), scan_it, scan_it + offset_size_at_level, scan_out.begin());
```

The `scan_it` lambda will read one offset value passed the end due to the `off[i + 1]` statement. The `exclusive_scan` does not actually use the last element so the code was modified to just return `false` if the index, `i` is greater than or equal to the size of the offsets column.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Devavret Makkar (https://github.com/devavret)

URL: #8995
rapids-bot bot pushed a commit that referenced this pull request Aug 23, 2021
…8994)

Reference issue #8883 and depends on fixes in PR #8884

The `get_list_child_to_list_row_mapping` builds a map for rolling operation on a lists column. In the `thrust::scatter` call a map value includes the last offset which will always be out-of-bounds to given output vector. This output vector is used to build the resultant output map by calling `thrust::inclusive_scan` but the out-of-bounds offset value is not used -- which is why the utility does not fail. The fix in this PR simply allocates an extra row in the intermediate vector so the `thrust::scatter` will not write to out-of-bounds memory. Since the value is eventually ignored, it does not effect the result.

The code in this function was creating many temporary columns incorrectly using the passed in `device_resource_manager` variable `mr`. The code was corrected by changing these to be just `device_uvector's` instead making it more clear that these are internal temporary memory buffers. Further the code calling `get_list_child_to_list_row_mapping` utility is using the output as a temporary column and so this PR fixes the logic to correct the memory resource usage.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - MithunR (https://github.com/mythrocks)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #8994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Memory errors when libcudf tests are run with rmm_mode=cuda
5 participants