Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable memcheck for jni unit tests #1321

Merged
merged 9 commits into from
Aug 16, 2023

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Aug 8, 2023

closes #1221

compute-saniziter is a tool can detect some GPU memory relevant issues, this PR is to enable it in memory check mode for the JNI unit tests.

Some explanation for the parameters of the Compute Sanitizer.
--log-file should be used to avoid a corrupting output issue from the surefire plugin.
--error-exitcode is used to fail the build process if any error is caught by the Compute Sanitizer.
--launch-timeout is set to 10 minutes, and it should be enough since we monitor only the forked test processes.

Please note it will take much more time to run the tests with the compute sanitizer. There are some numbers for some tests that the elapsed time became quite long.

Test No Sanitizer With Sanitizer
Aggregation128UtilsTest 2.54 s 8.281 s
BinaryOpTest 0.168 s 3.36 s
ColumnVectorTest 0.912 s 21.192 s
ReductionTest 0.104 s 4.757 s
TableTest 2.385 s 124.59 s
ColumnViewNonEmptyNullsTest 2.615 s 8.385 s

ci/nightly-build.sh Outdated Show resolved Hide resolved
@firestarman firestarman requested a review from jlowe August 8, 2023 06:21
@firestarman firestarman changed the title Enable memcheck for jni unit tests Enable memcheck for jni unit tests [skip ci] Aug 8, 2023
@pxLi
Copy link
Collaborator

pxLi commented Aug 8, 2023

some questions about the tooling,
How should we expose the detected issues? Will the script error out if detect any issue like a memory leak?
If yes, do we need to fix them all first before merging to make sure our regular CI won't be blocked instantly after enabling this tool?
If not, how should we notify developers about the issues?

thanks

@firestarman
Copy link
Collaborator Author

firestarman commented Aug 8, 2023

How should we expose the detected issues? Will the script error out if detect any issue like a memory leak?

You can specify where Compute Sanitizer will write all of its text output to by setting the --log-file option. (e.g. export SANITIZER_OPTS="--log-file /tmp/memcheck.log")

By default, Compute Sanitizer will print all output to stdout.

Per my local tests, This tool only records the errors into log, without breaking the running process.

If yes, do we need to fix them all first before merging to make sure our regular CI won't be blocked instantly after enabling this tool?
If not, how should we notify developers about the issues?

@GaryShen2008 may know what we should do.

@jlowe
Copy link
Member

jlowe commented Aug 8, 2023

Per my local tests, This tool only records the errors into log, without breaking the running process.

Don't we want the premerge to fail if it detects an error? If it just records into a log on the side without failing the permerge or nightly build, nobody is going to think to check the log.

ci/nightly-build.sh Outdated Show resolved Hide resolved
@abellina
Copy link
Collaborator

abellina commented Aug 8, 2023

Per my local tests, This tool only records the errors into log, without breaking the running process.

Don't we want the premerge to fail if it detects an error? If it just records into a log on the side without failing the permerge or nightly build, nobody is going to think to check the log.

We could try error-exitcode set to 1 or something that isn't 0. It looks like if the application fails, compute-sanitizer echoes its error code. The error code in this parameter is for the case where the app succeeds (exit code 0), but compute-sanitizer detected an issue. Unfortunately, I do not see a "kill on first failure" option. I am a little afraid about that here since the CI tests run for a long time.

@firestarman firestarman marked this pull request as draft August 9, 2023 02:51
@firestarman
Copy link
Collaborator Author

firestarman commented Aug 9, 2023

Move to draft, since the requirement is quite different than I thought.

@firestarman
Copy link
Collaborator Author

Per my local tests, This tool only records the errors into log, without breaking the running process.

Don't we want the premerge to fail if it detects an error? If it just records into a log on the side without failing the permerge or nightly build, nobody is going to think to check the log.

We could try error-exitcode set to 1 or something that isn't 0. It looks like if the application fails, compute-sanitizer echoes its error code. The error code in this parameter is for the case where the app succeeds (exit code 0), but compute-sanitizer detected an issue. Unfortunately, I do not see a "kill on first failure" option. I am a little afraid about that here since the CI tests run for a long time.

Thx and yes, this option works, setting it to -1 can let sanitizer fail if it detects any error even the application succeeds.

Signed-off-by: Liangcai Li <[email protected]>
Signed-off-by: Liangcai Li <[email protected]>
@firestarman
Copy link
Collaborator Author

Depends on rapidsai/cudf#13872

@firestarman firestarman changed the title Enable memcheck for jni unit tests [skip ci] Enable memcheck for jni unit tests Aug 14, 2023
@firestarman firestarman marked this pull request as ready for review August 14, 2023 07:36
@firestarman firestarman requested review from jlowe, pxLi and abellina August 14, 2023 07:36
@firestarman
Copy link
Collaborator Author

Two follow-ups:
#1336
#1337

@firestarman
Copy link
Collaborator Author

firestarman commented Aug 14, 2023

Please do not triggre premerge until rapidsai/cudf#13872 is merged.

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
ci/premerge-build.sh Outdated Show resolved Hide resolved
ci/premerge-build.sh Show resolved Hide resolved
build/sanitizer-java/bin/java Show resolved Hide resolved
build/sanitizer-java/bin/java Show resolved Hide resolved
build/sanitizer-java/bin/java Outdated Show resolved Hide resolved
pom.xml Show resolved Hide resolved
@pxLi
Copy link
Collaborator

pxLi commented Aug 15, 2023

@jlowe @abellina Liangcai helps record the test duration w/ sanitizer in PR desc
Currently, JNI does not have too many cases so the total time may not increase too much. But in the future after we add more cases, the significantly increased duration may affect regular CI and increase costs, do you think we are okay to add it to nightly only (or a dedicated sanitizer pipeline) but not pre-merge CI?

Signed-off-by: Liangcai Li <[email protected]>
@firestarman firestarman requested a review from jlowe August 15, 2023 03:19
@firestarman
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we are okay to add it to nightly only (or a dedicated sanitizer pipeline) but not pre-merge CI?

Yes, that's totally reasonable to me. I'm fine with this going in as-is or we can disable the sanitizer option in the premerge-build.sh before it goes in.

@jlowe
Copy link
Member

jlowe commented Aug 15, 2023

@pxLi note that we should also enable the sanitizer builds on the cudf submodule sync job as well, since we don't want to pull in a cudf change that the sanitizer thinks is broken.

@firestarman this is running the sanitizer on the Java unit test, but I assume it is not running it on the spark-rapids-jni C++ unit tests, correct? If so that can be a followup, since we should be running the sanitizer on those as well.

@firestarman
Copy link
Collaborator Author

firestarman commented Aug 16, 2023

@firestarman this is running the sanitizer on the Java unit test, but I assume it is not running it on the spark-rapids-jni C++ unit tests, correct? If so that can be a followup, since we should be running the sanitizer on those as well.

@jlowe Yeah, this is only for the Java unit tests. BTW, Where can I find the the spark-rapids-jni C++ unit tests ? There is only a json file under https://github.com/NVIDIA/spark-rapids-jni/tree/branch-23.10/src/test/cpp/ .

Copy link
Collaborator

@pxLi pxLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will follow up on the update to submodule syncs-up CI

@firestarman
Copy link
Collaborator Author

firestarman commented Aug 16, 2023

So I am going to merge this as-is and file a new PR.

@firestarman firestarman merged commit b99cf35 into NVIDIA:branch-23.10 Aug 16, 2023
@firestarman
Copy link
Collaborator Author

I will follow up on the update to submodule syncs-up CI

Here it is #1347

@firestarman firestarman deleted the tests-with-memcheck branch August 16, 2023 03:25
@jlowe
Copy link
Member

jlowe commented Aug 16, 2023

Where can I find the the spark-rapids-jni C++ unit tests ?

Under src/main/cpp/tests.

@firestarman
Copy link
Collaborator Author

Thx for filing #1353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run compute-sanitizer memcheck on tests
5 participants