Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run compute-sanitizer in nightly build #9641

Merged
merged 16 commits into from
Nov 30, 2021

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Nov 9, 2021

Addresses part of #904

  • This PR enables run of compute-sanitizer --tool memcheck on libcudf unit tests when env COMPUTE_SANITIZER_ENABLE=true
    This env COMPUTE_SANITIZER_ENABLE will be enabled only in nightly builds of cudf. (To be Enabled in PR https://github.com/rapidsai/gpuci-scripts/pull/675)
  • This PR also adds script to parse compute-sanitizer log to junit xml file which can be processed by Jenkins.
    Reports only failures. If no errors, no tests are reported under memcheck results.

Note: Only memcheck is enabled now. when required, other checks of compute-sanitizer could be enabled later.

@karthikeyann karthikeyann added feature request New feature or request 3 - Ready for Review Ready for review by team gpuCI non-breaking Non-breaking change labels Nov 9, 2021
@karthikeyann karthikeyann requested review from a team as code owners November 9, 2021 22:01
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 9, 2021
@davidwendt
Copy link
Contributor

Seems like this should not be going into 21.12 since we are well into burndown.

@codecov
Copy link

codecov bot commented Nov 9, 2021

Codecov Report

Merging #9641 (2a7f20d) into branch-22.02 (967a333) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9641      +/-   ##
================================================
- Coverage         10.49%   10.47%   -0.02%     
================================================
  Files               119      119              
  Lines             20305    20343      +38     
================================================
  Hits               2130     2130              
- Misses            18175    18213      +38     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/multiindex.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1811b5...2a7f20d. Read the comment docs.

@karthikeyann
Copy link
Contributor Author

rerun tests

@karthikeyann
Copy link
Contributor Author

karthikeyann commented Nov 11, 2021

Preferably merge to 21.12 so that memchecks could run in nightly builds for previous branches as well.
Not enabled in gpuci nightly build yet. So, it's not be enabled in PR CI runs.

@davidwendt
Copy link
Contributor

Preferably merge to 21.12 so that memchecks could run in nightly builds for previous branches as well. Not enabled in gpuci nightly build yet. So, it's not be enabled in PR CI runs.

The concern is even the faint possibly of breaking the build at a time when the build is critical (and build team over engaged) in the release cycle.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitpicks.

Bigger picture: I was wondering why we're converting to the junit format.

ci/gpu/build.sh Outdated Show resolved Hide resolved
cpp/scripts/compute-sanitizer-to-junit-xml.py Outdated Show resolved Hide resolved
testcase_name = ""
else:
pass
# raise Exception('unexpected line in compute-sanitizer log: '+line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. If none of the above, should the tool pass?
I think I can see why we'd leave this comment in. Might be useful to uncomment for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The log may contain stdout from googletest or pytest which tool will ignore (pytest is not memchecked yet, but will also be too large to add, probably not nightly, may be weekly).
This is added for debug for analyzing only memcheck log. (using --log-file path)

@karthikeyann
Copy link
Contributor Author

karthikeyann commented Nov 15, 2021

Bigger picture: I was wondering why we're converting to the junit format.

so that Jenkins can display it similar to how gtest results are shown. Jenkins Junit plugin expects the results in XML.
https://docs.rapids.ai/maintainers/gpuci/#cigpubuildsh

Besides, compute-sanitizer 11.6 has support for xml which could be used directly for reporting memcheck in future. (still need translation at Jenkins Junit plugin)

Whichever is most convenient method to review memcheck results can be used. Is there any preferred way of report for compute-sanitizer results? (should not limited to memcheck but also leakcheck, racecheck, initcheck too).

@karthikeyann karthikeyann changed the base branch from branch-21.12 to branch-22.02 November 15, 2021 03:36
@karthikeyann
Copy link
Contributor Author

Thank you for the reviews @mythrocks and @davidwendt

@karthikeyann karthikeyann added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 16, 2021
@karthikeyann karthikeyann removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 16, 2021
@karthikeyann
Copy link
Contributor Author

Tested these environment variables in local docker image after simplifying the PR https://github.com/rapidsai/gpuci-scripts/pull/675

@karthikeyann karthikeyann added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 16, 2021
@karthikeyann
Copy link
Contributor Author

rerun tests

@karthikeyann
Copy link
Contributor Author

rerun tests

1 similar comment
@karthikeyann
Copy link
Contributor Author

rerun tests

@karthikeyann
Copy link
Contributor Author

rerun tests

1 similar comment
@karthikeyann
Copy link
Contributor Author

rerun tests

@karthikeyann
Copy link
Contributor Author

@github-actions github-actions bot removed the libcudf Affects libcudf (C++/CUDA) code. label Nov 24, 2021
@karthikeyann karthikeyann removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 24, 2021
Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@karthikeyann karthikeyann marked this pull request as ready for review November 30, 2021 04:40
@karthikeyann
Copy link
Contributor Author

rerun tests

@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1697f63 into rapidsai:branch-22.02 Nov 30, 2021
rapids-bot bot pushed a commit that referenced this pull request Dec 1, 2021
While working on #9641 I noticed that building the iterator gtests takes alot of time in CI. Here is a link to the individual build times for libcudf including the gtests:
https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=11.5,GPU_LABEL=driver-495,LINUX_VER=ubuntu20.04,PYTHON=3.8/5173/testReport/(root)/BuildTime/
(you can sort by Duration by clicking on table colum header).

Here is a table of the top 20 compile time offenders as recorded on my local machine. Note that like the CI build output, 6 of the top 20 are just building the `ITERATOR_TEST`

| rank | time (ms) | file |
| ---:| ---:|:--- |
|  1 | 814334 | /cudf.dir/src/search/search.cu.o
|  2 | 755375 | /cudf.dir/src/sort/sort_column.cu.o
|  3 | 686235 | /ITERATOR_TEST.dir/iterator/optional_iterator_test_numeric.cu.o
|  4 | 670587 | /cudf.dir/src/groupby/sort/group_nunique.cu.o
|  5 | 585524 | /cudf.dir/src/reductions/scan/scan_inclusive.cu.o
|  6 | 582677 | /ITERATOR_TEST.dir/iterator/pair_iterator_test_numeric.cu.o
|  7 | 568418 | /ITERATOR_TEST.dir/iterator/scalar_iterator_test.cu.o
|  8 | 563196 | /cudf.dir/src/sort/sort.cu.o
|  9 | 548816 | /ITERATOR_TEST.dir/iterator/value_iterator_test_numeric.cu.o
| 10 | 535315 | /cudf.dir/src/groupby/sort/sort_helper.cu.o
| 11 | 531384 | /cudf.dir/src/sort/is_sorted.cu.o
| 12 | 530382 | /ITERATOR_TEST.dir/iterator/value_iterator_test_chrono.cu.o
| 13 | 525187 | /cudf.dir/src/join/semi_join.cu.o
| 14 | 523726 | /cudf.dir/src/rolling/rolling.cu.o
| 15 | 517909 | /cudf.dir/src/reductions/product.cu.o
| 16 | 513119 | /cudf.dir/src/stream_compaction/distinct_count.cu.o
| 17 | 512569 | /ITERATOR_TEST.dir/iterator/optional_iterator_test_chrono.cu.o
| 18 | 508978 | /cudf.dir/src/reductions/sum_of_squares.cu.o
| 19 | 508460 | /cudf.dir/src/lists/drop_list_duplicates.cu.o
| 20 | 505247 | /cudf.dir/src/reductions/sum.cu.o

I made some simple changes to the iterator code logic to use different thrust functions along with a temporary device vector. This approach improved the compile time of the `ITERATOR_TEST` by about 3x. Here are the results of compiling the above 6 files with the changes in this PR.

| new rank | new time (ms) | file |
| ---:| ---:|:--- |
| 59 | 232691 (2.9x) | optional_iterator_test_numeric.cu.o |
| 26 | 416951 (1.4x) | pair_iterator_test_numeric.cu.o |
| 92 | 165947 (3.4x) | scalar_iterator_test.cu.o |
| 65 | 216364 (2.5x) | value_iterator_test_numeric.cu.o |
| 77 | 186583 (2.8x) | value_iterator_test_chrono.cu.o |
| 111 | 137789 (3.7x) | optional_iterator_test_chrono.cu.o |

Total overall build time improved locally by ~3m (10%) using `ninja -j48 install` on a Dell 5820.

Here are the build time results of a CI build with these changes.
https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=11.5,GPU_LABEL=driver-495,LINUX_VER=ubuntu20.04,PYTHON=3.8/5190/testReport/(root)/BuildTime/

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Devavret Makkar (https://github.com/devavret)

URL: #9788
continue
fi
echo "Running GoogleTest $test_name"
${COMPUTE_SANITIZER_CMD} ${gt} | tee "$WORKSPACE/test-results/${test_name}.cs.log"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this have been the following:

 ${COMPUTE_SANITIZER_CMD} ${gt} --rmm_mode=cuda | tee "$WORKSPACE/test-results/${test_name}.cs.log"

And then you would not need to set and unset the GTEST_CUDF_RMM_MODE environment variable?

Copy link
Contributor Author

@karthikeyann karthikeyann Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use either one for google tests. Initially we wanted to use ctest and cdash to run and report memcheck results. It was not possible to add arguments only for memcheck in ctest. So, we went with environmental variable. Since we want to report to Jenkins, we didn't use ctest/cdash
Hoping to use the environmental variable for both gtests and pytests. (variable name might change)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the cuda async mode for RMM. That will help the tests run faster by using the CUDA pool but still have memcheck support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants