Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unit tests failed on arm host when we set "USE_SANITIZER=OFF" #2022

Closed
NvTimLiu opened this issue May 8, 2024 · 7 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented May 8, 2024

Describe the bug

Unit tests failed on arm host due to out-of-memory error with rmm, when we set the 'USE_SANITIZER' flag to 'OFF'. Below are the details of the failure.

Steps/Code to reproduce bug
1, set: -DUSE_SANITIZER=OFF
2, On the arm host, run below script to compile

scl enable gcc-toolset-11 'mvn -DUSE_GDS=OFF -DBUILD_FAULTINJ=OFF -Dcuda.version=cuda11 -DUSE_SANITIZER=OFF clean package'

or

bash build/build-in-docker clean package

Expected behavior
All Unit tests can PASS

Environment details (please complete the following information)

  • run maven compile on the arm64 host

Additional context
To follow up : #2010 (comment)

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running ai.rapids.cudf.Aggregation128UtilsTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.802 s - in ai.rapids.cudf.Aggregation128UtilsTest
[INFO] Running ai.rapids.cudf.ArrowColumnVectorTest
[INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.19 s - in ai.rapids.cudf.ArrowColumnVectorTest
[INFO] Running ai.rapids.cudf.BinaryOpTest
[INFO] Tests run: 45, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 s - in ai.rapids.cudf.BinaryOpTest
[INFO] Running ai.rapids.cudf.ByteColumnVectorTest
[INFO] Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.192 s - in ai.rapids.cudf.ByteColumnVectorTest
[INFO] Running ai.rapids.cudf.ColumnVectorTest
[WARNING] Tests run: 309, Failures: 0, Errors: 0, Skipped: 2, Time elapsed: 0.762 s - in ai.rapids.cudf.ColumnVectorTest
[INFO] Running ai.rapids.cudf.CudaTest
[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.054 s <<< FAILURE! - in ai.rapids.cudf.CudaTest
[ERROR] testCudaException  Time elapsed: 0.05 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: Expected ai.rapids.cudf.CudaException to be thrown, but nothing was thrown.
        at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:71)
        at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:37)
        at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3082)
        at ai.rapids.cudf.CudaTest.testCudaException(CudaTest.java:40)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
        at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
        at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
        at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
        at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
        at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
        at org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
        at org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
        at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
        at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
        at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
        at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
        at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
        at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
        at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:214)
        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
        at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:210)
        at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:135)
        at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:66)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)
        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
        at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
        at java.util.ArrayList.forEach(ArrayList.java:1259)
        at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
        at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
        at java.util.ArrayList.forEach(ArrayList.java:1259)
        at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
        at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
        at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
        at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:35)
        at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:57)
        at org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:54)
        at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
        at org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
        at org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
        at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
        at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
        at org.junit.platform.surefire.provider.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:155)
        at org.junit.platform.surefire.provider.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:134)
        at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:383)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:344)
        at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:417)

[INFO] Running ai.rapids.cudf.Date32ColumnVectorTest
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   CudaTest.testCudaException:40 Expected ai.rapids.cudf.CudaException to be thrown, but nothing was thrown.
[INFO] 
[ERROR] Tests run: 381, Failures: 1, Errors: 0, Skipped: 2
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 28.515 s
[INFO] Finished at: 2024-05-08T05:13:31Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.0:test (default-test) on project spark-rapids-jni: There are test failures.
[ERROR] 
[ERROR] Please refer to /home/nvidia/timl/spark-rapids-jni/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, [date].dumpstream and [date]-jvmRun[N].dumpstream.
[ERROR] There was an error in the forked process
[ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/nvidia/timl/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:254: Maximum pool size exceeded
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process
[ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/nvidia/timl/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:254: Maximum pool size exceeded
[ERROR]         at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:658)
[ERROR]         at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:533)
[ERROR]         at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:278)
[ERROR]         at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:244)
[ERROR]         at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1194)
[ERROR]         at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1022)
[ERROR]         at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:868)
[ERROR]         at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
[ERROR]         at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
[ERROR]         at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
[ERROR]         at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
[ERROR]         at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
[ERROR]         at org.apache.maven.cli.MavenCli.execute(MavenCli.java:954)
[ERROR]         at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
[ERROR]         at org.apache.maven.cli.MavenCli.main(MavenCli.java:192)
[ERROR]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[ERROR]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[ERROR]         at java.lang.reflect.Method.invoke(Method.java:498)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
[ERROR] 
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
@NvTimLiu NvTimLiu added bug Something isn't working ? - Needs Triage labels May 8, 2024
@jlowe
Copy link
Member

jlowe commented May 8, 2024

@NvTimLiu we've been doing arm64 builds on rocky8 for a while now, before the changes proposed in #2010. How were we getting past test failures then? Were they not failing? If so, I wonder what's different about the changes in #2010 vs. what's in Dockerfile.multi that's significant. Or have the arm64 tests always been failing without the sanitizer?

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented May 9, 2024

@NvTimLiu we've been doing arm64 builds on rocky8 for a while now, before the changes proposed in #2010. How were we getting past test failures then? Were they not failing? If so, I wonder what's different about the changes in #2010 vs. what's in Dockerfile.multi that's significant. Or have the arm64 tests always been failing without the sanitizer?

How were we getting past test failures then? Were they not failing?
--> We get tests PASS as nightly build for arm64 : https://github.com/NVIDIA/spark-rapids-jni/blob/branch-24.06/ci/nightly-build.sh#L36-L38

So we get the nightly build/tests PASS for arm64 with

  USE_GDS="OFF"
  USE_SANITIZER="ON"
  BUILD_FAULTINJ="OFF"

@res-life
Copy link
Collaborator

res-life commented May 9, 2024

USE_SANITIZER="ON" will skip this test testCudaException because this test has annotation @Tag("noSanitizer")

@res-life res-life closed this as completed May 9, 2024
@res-life res-life reopened this May 9, 2024
@GaryShen2008
Copy link
Collaborator

GaryShen2008 commented May 9, 2024

Related to rapidsai/cudf#15705. @sperlingxx can you explain more details about the purpose of the test case? So that we can understand better about this issue and its fixing.

@sperlingxx
Copy link
Collaborator

sperlingxx commented May 9, 2024

This case is used to test if cuDF is able to capture CUDA error information and wrap them with corresponding Java Exception wrapper (CudaFatalException or CudaException; and whether CudaErrorCode is matched or not).

Currently, we use cudaMemsetAsync with an extreme large memory address (Cuda.memset(Long.MAX_VALUE, (byte) 0, 1024)) to trigger nonFatal CUDA error cudaErrorInvalidValue(1), which should be captured as CudaException instead of CudaFatalException. However, cudaMemsetAsync with an extreme large memory address might trigger Fatal CUDA error cudaErrorIllegalAddress(700) under Aarch64.

The PR rapidsai/cudf#15706 tries to fix this problem via replacing with cudaFree a negative memory address, which has been verified under grace-hopper environment.

@sperlingxx sperlingxx self-assigned this May 9, 2024
@sameerz
Copy link
Collaborator

sameerz commented May 13, 2024

The PR rapidsai/cudf#15706 tries to fix this problem via replacing with cudaFree a negative memory address, which has been verified under grace-hopper environment.

Can this issue be closed if the above PR is merged?

@GaryShen2008
Copy link
Collaborator

Yes, we can close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants