Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] nightly ai.rapids.cudf.ReductionTest failed in cuda12 ENV after enable sanitizer #1349

Open
pxLi opened this issue Aug 16, 2023 · 12 comments
Assignees
Labels
bug Something isn't working test

Comments

@pxLi
Copy link
Collaborator

pxLi commented Aug 16, 2023

Describe the bug
The same tests with sanitizer passed correctly in cuda 11, but not cuda 12

pipeline: spark-rapids-jni_nightly-dev, build ID:512

attached sanitizer log: sanitizer_for_pid_20785.log

[2023-08-16T04:25:35.253Z] [INFO] -------------------------------------------------------
[2023-08-16T04:25:35.253Z] [INFO]  T E S T S
[2023-08-16T04:25:35.253Z] [INFO] -------------------------------------------------------
[2023-08-16T04:25:37.150Z] [INFO] Running ai.rapids.cudf.Aggregation128UtilsTest
[2023-08-16T04:25:49.342Z] [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 10.72 s - in ai.rapids.cudf.Aggregation128UtilsTest
[2023-08-16T04:25:49.342Z] [INFO] Running ai.rapids.cudf.ArrowColumnVectorTest
[2023-08-16T04:25:49.342Z] [INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.245 s - in ai.rapids.cudf.ArrowColumnVectorTest
[2023-08-16T04:25:49.342Z] [INFO] Running ai.rapids.cudf.BinaryOpTest
[2023-08-16T04:25:54.601Z] [INFO] Tests run: 45, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.746 s - in ai.rapids.cudf.BinaryOpTest
[2023-08-16T04:25:54.601Z] [INFO] Running ai.rapids.cudf.ByteColumnVectorTest
[2023-08-16T04:25:54.860Z] [INFO] Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.882 s - in ai.rapids.cudf.ByteColumnVectorTest
[2023-08-16T04:25:54.860Z] [INFO] Running ai.rapids.cudf.ColumnVectorTest
[2023-08-16T04:26:41.506Z] [WARNING] Tests run: 316, Failures: 0, Errors: 0, Skipped: 2, Time elapsed: 41.466 s - in ai.rapids.cudf.ColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.CudaTest
[2023-08-16T04:26:41.507Z] [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s - in ai.rapids.cudf.CudaTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.Date32ColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 s - in ai.rapids.cudf.Date32ColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.Date64ColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.024 s - in ai.rapids.cudf.Date64ColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.DecimalColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.127 s - in ai.rapids.cudf.DecimalColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.DoubleColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.62 s - in ai.rapids.cudf.DoubleColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.FloatColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.602 s - in ai.rapids.cudf.FloatColumnVectorTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.GatherMapTest
[2023-08-16T04:26:41.507Z] [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.238 s - in ai.rapids.cudf.GatherMapTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.HashJoinTest
[2023-08-16T04:26:41.507Z] [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.037 s - in ai.rapids.cudf.HashJoinTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.HostMemoryBufferTest
[2023-08-16T04:26:41.507Z] [WARNING] Tests run: 14, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.144 s - in ai.rapids.cudf.HostMemoryBufferTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.IfElseTest
[2023-08-16T04:26:41.507Z] [INFO] Tests run: 110, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.336 s - in ai.rapids.cudf.IfElseTest
[2023-08-16T04:26:41.507Z] [INFO] Running ai.rapids.cudf.IntColumnVectorTest
[2023-08-16T04:26:42.440Z] [INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.615 s - in ai.rapids.cudf.IntColumnVectorTest
[2023-08-16T04:26:42.440Z] [INFO] Running ai.rapids.cudf.LongColumnVectorTest
[2023-08-16T04:26:43.006Z] [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.613 s - in ai.rapids.cudf.LongColumnVectorTest
[2023-08-16T04:26:43.006Z] [INFO] Running ai.rapids.cudf.MemoryBufferTest
[2023-08-16T04:26:43.006Z] [INFO] Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s - in ai.rapids.cudf.MemoryBufferTest
[2023-08-16T04:26:43.006Z] [INFO] Running ai.rapids.cudf.NvtxTest
[2023-08-16T04:26:43.006Z] [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s - in ai.rapids.cudf.NvtxTest
[2023-08-16T04:26:43.006Z] [INFO] Running ai.rapids.cudf.PinnedMemoryPoolTest
[2023-08-16T04:26:43.265Z] [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.342 s - in ai.rapids.cudf.PinnedMemoryPoolTest
[2023-08-16T04:26:43.265Z] [INFO] Running ai.rapids.cudf.ReductionTest
[2023-08-16T04:26:51.378Z] [ERROR] Tests run: 130, Failures: 0, Errors: 90, Skipped: 0, Time elapsed: 8.092 s <<< FAILURE! - in ai.rapids.cudf.ReductionTest
[2023-08-16T04:26:51.378Z] [ERROR] testShort{ReductionAggregation, Short[], DataType, Object, Double}[4]  Time elapsed: 3.761 s  <<< ERROR!
[2023-08-16T04:26:51.378Z] ai.rapids.cudf.CudaFatalException: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-512-cuda12/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/device_uvector.hpp:316: cudaErrorLaunchFailure unspecified launch failure
[2023-08-16T04:26:51.378Z] ai.rapids.cudf.CudaFatalException: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-512-cuda12/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/device_uvector.hpp:316: cudaErrorLaunchFailure unspecified launch failure
[2023-08-16T04:26:51.378Z] 	at ai.rapids.cudf.Scalar.isScalarValid(Native Method)
[2023-08-16T04:26:51.378Z] 	at ai.rapids.cudf.Scalar.isValid(Scalar.java:568)
[2023-08-16T04:26:51.378Z] 	at ai.rapids.cudf.Scalar.equals(Scalar.java:707)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.api.AssertionUtils.objectsAreEqual(AssertionUtils.java:193)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:181)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1141)
[2023-08-16T04:26:51.378Z] 	at ai.rapids.cudf.ReductionTest.assertEqualsDelta(ReductionTest.java:413)
[2023-08-16T04:26:51.378Z] 	at ai.rapids.cudf.ReductionTest.testShort(ReductionTest.java:462)
[2023-08-16T04:26:51.378Z] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[2023-08-16T04:26:51.378Z] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[2023-08-16T04:26:51.378Z] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2023-08-16T04:26:51.378Z] 	at java.lang.reflect.Method.invoke(Method.java:498)
[2023-08-16T04:26:51.378Z] 	at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:214)
[2023-08-16T04:26:51.378Z] 	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:210)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:135)
[2023-08-16T04:26:51.378Z] 	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:66)
[2023-08-16T04:26:51.378Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)
[2023-08-16T04:26:51.378Z] 	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:35)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask$DefaultDynamicTestExecutor.execute(NodeTestTask.java:226)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask$DefaultDynamicTestExecutor.execute(NodeTestTask.java:204)
[2023-08-16T04:26:51.379Z] 	at org.junit.jupiter.engine.descriptor.TestTemplateTestDescriptor.execute(TestTemplateTestDescriptor.java:139)
[2023-08-16T04:26:51.379Z] 	at org.junit.jupiter.engine.descriptor.TestTemplateTestDescriptor.lambda$execute$2(TestTemplateTestDescriptor.java:107)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2023-08-16T04:26:51.379Z] 	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
[2023-08-16T04:26:51.379Z] 	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
[2023-08-16T04:26:51.379Z] 	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
[2023-08-16T04:26:51.379Z] 	at org.junit.jupiter.engine.descriptor.TestTemplateTestDescriptor.execute(TestTemplateTestDescriptor.java:107)
[2023-08-16T04:26:51.379Z] 	at org.junit.jupiter.engine.descriptor.TestTemplateTestDescriptor.execute(TestTemplateTestDescriptor.java:42)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
[2023-08-16T04:26:51.379Z] 	at java.util.ArrayList.forEach(ArrayList.java:1259)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
[2023-08-16T04:26:51.379Z] 	at java.util.ArrayList.forEach(ArrayList.java:1259)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
[2023-08-16T04:26:51.379Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:35)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:57)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:54)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.surefire.provider.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:155)
[2023-08-16T04:26:51.380Z] 	at org.junit.platform.surefire.provider.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:134)
[2023-08-16T04:26:51.380Z] 	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:383)
[2023-08-16T04:26:51.380Z] 	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:344)
[2023-08-16T04:26:51.380Z] 	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
[2023-08-16T04:26:51.380Z] 	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:417)

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi
Copy link
Collaborator Author

pxLi commented Aug 16, 2023

summary

[2023-08-16T04:26:52.487Z] [ERROR] Tests run: 729, Failures: 0, Errors: 90, Skipped: 3
[2023-08-16T04:26:52.487Z] [INFO] 
[2023-08-16T04:26:52.487Z] [INFO] ------------------------------------------------------------------------
[2023-08-16T04:26:52.487Z] [INFO] BUILD FAILURE
[2023-08-16T04:26:52.487Z] [INFO] ------------------------------------------------------------------------
[2023-08-16T04:26:52.487Z] [INFO] Total time: 54:55.118s
[2023-08-16T04:26:52.487Z] [INFO] Finished at: Wed Aug 16 04:26:52 UTC 2023
[2023-08-16T04:26:52.487Z] [INFO] Final Memory: 43M/1324M
[2023-08-16T04:26:52.487Z] [INFO] ------------------------------------------------------------------------
[2023-08-16T04:26:52.487Z] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.0:test (default-test) on project spark-rapids-jni: There are test failures.
[2023-08-16T04:26:52.487Z] [ERROR] 
[2023-08-16T04:26:52.487Z] [ERROR] Please refer to /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-512-cuda12/target/surefire-reports for the individual test results.
[2023-08-16T04:26:52.487Z] [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, [date].dumpstream and [date]-jvmRun[N].dumpstream.
[2023-08-16T04:26:52.487Z] [ERROR] There was an error in the forked process
[2023-08-16T04:26:52.487Z] [ERROR] Error occurred in starting fork, check output in log
[2023-08-16T04:26:52.487Z] [ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-512-cuda12/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:196: Maximum pool size exceeded
[2023-08-16T04:26:52.487Z] [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process
[2023-08-16T04:26:52.487Z] [ERROR] Error occurred in starting fork, check output in log
[2023-08-16T04:26:52.487Z] [ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-512-cuda12/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:196: Maximum pool size exceeded

and error dump Maximum pool size exceeded

# Created at 2023-08-16T04:26:51.225
java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-512-cuda12/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:196: Maximum pool size exceeded
	at ai.rapids.cudf.Rmm.newPoolMemoryResource(Native Method)
	at ai.rapids.cudf.RmmPoolMemoryResource.<init>(RmmPoolMemoryResource.java:39)
	at ai.rapids.cudf.Rmm.initialize(Rmm.java:238)
	at ai.rapids.cudf.CudfTestBase.beforeEach(CudfTestBase.java:46)
	at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
	at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
	at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
	at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
	at org.junit.jupiter.engine.extension.TimeoutExtension.interceptLifecycleMethod(TimeoutExtension.java:126)

@firestarman
Copy link
Collaborator

firestarman commented Aug 16, 2023

A soft reminder, if it is not easy to fix in a short time, you can tag the failing tests with noSanitizer as a WAR to exclude it from running with Sanitizer.

@Tag("noSanitizer")
class ReductionTest extends CudfTestBase {

@jlowe
Copy link
Member

jlowe commented Aug 16, 2023

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-512-cuda12/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:196: Maximum pool size exceeded

This implies either we're not leaving enough memory reserved for the sanitizer to run (e.g.: using ARENA allocator somehow) or we're somehow sharing a GPU and there isn't enough free memory with the sanitizer present to run.

@pxLi
Copy link
Collaborator Author

pxLi commented Aug 17, 2023

confirmed always reproducible with cuda12 (cuda toolkit 12.0.1 + driver 525.58) in our nightly CI.
looks like memory leak in tests or this specific cuda12 sanitizer does require to consume more memory

Please help check if there is a quick fix (like reduce GPU mem cost of specific cases, adjust rmm pool size in test) or just tag noSanitizer for now

@res-life res-life self-assigned this Aug 17, 2023
@jlowe
Copy link
Member

jlowe commented Aug 17, 2023

I tried bumping up the memory pool but it still fails with an unspecified launch error. I'll post a cudf PR to add the noSanitizer tag to the failing tests.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Aug 18, 2023
… CUDA 12 (#13904)

Relates to NVIDIA/spark-rapids-jni#1349.  The Java ReductionTest unit tests are failing when run under CUDA 12's compute-sanitizer but pass when run with the CUDA 11 version.  To unblock CI, marking the affected tests to be run without the sanitizer in the interim while this is being investigated.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Gera Shegalov (https://github.com/gerashegalov)

URL: #13904
@bdice
Copy link
Contributor

bdice commented Aug 21, 2023

Can you confirm what CCCL version was used in these builds? Is it the CCCL shipped with the CTK, or the CCCL (thrust/cub/libcudacxx) pinned by rapids-cmake?

@pxLi
Copy link
Collaborator Author

pxLi commented Aug 22, 2023

Can you confirm what CCCL version was used in these builds? Is it the CCCL shipped with the CTK, or the CCCL (thrust/cub/libcudacxx) pinned by rapids-cmake?

hi @bdice, the CTK in our CI was from official nvidia/cuda:12.0.1-devel-centos7 image

The CUDA compiler identification is NVIDIA 12.0.140
Found CUDAToolkit: /usr/local/cuda/include (found version "12.0.140") 

and for libs(thrust/cub/libcudacxx), they should be pulled by rapids-cmake while building cudf

[2023-08-17T03:31:52.132Z] [INFO]      [exec] -- CPM: adding package [email protected] (1.17.2)
[2023-08-17T03:32:06.977Z] [INFO]      [exec] CMake Deprecation Warning at build/_deps/thrust-src/CMakeLists.txt:9 (cmake_policy):
[2023-08-17T03:32:06.977Z] [INFO]      [exec]   The OLD behavior for policy CMP0104 will be removed from a future version
[2023-08-17T03:32:06.977Z] [INFO]      [exec]   of CMake.
[2023-08-17T03:32:06.977Z] [INFO]      [exec] 
[2023-08-17T03:32:06.977Z] [INFO]      [exec]   The cmake-policies(7) manual explains that the OLD behaviors of all
[2023-08-17T03:32:06.977Z] [INFO]      [exec]   policies are deprecated and that a policy should be set to OLD only under
[2023-08-17T03:32:06.977Z] [INFO]      [exec]   specific short-term circumstances.  Projects should be ported to the NEW
[2023-08-17T03:32:06.977Z] [INFO]      [exec]   behavior and not rely on setting a policy to OLD.
[2023-08-17T03:32:06.977Z] [INFO]      [exec] 
[2023-08-17T03:32:06.977Z] [INFO]      [exec] 
[2023-08-17T03:32:07.236Z] [INFO]      [exec] -- Found Thrust: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-513-cuda12/thirdparty/cudf/cpp/build/_deps/thrust-src/thrust/cmake/thrust-config.cmake (found version "1.17.2.0") 
[2023-08-17T03:32:07.236Z] [INFO]      [exec] -- rapids-cmake [Thrust]: applied diff install_rules.diff to fix issue: 'Thrust 1.X installs incorrect files [https://github.com/NVIDIA/thrust/issues/1790]'
[2023-08-17T03:32:07.236Z] [INFO]      [exec] -- rapids-cmake [Thrust]: applied diff thrust_transform_iter_with_reduce_by_key.diff to fix issue: 'Support transform_output_iterator as output of reduce by key [https://github.com/NVIDIA/thrust/pull/1805]'
[2023-08-17T03:32:07.236Z] [INFO]      [exec] -- rapids-cmake [Thrust]: applied diff thrust_disable_64bit_dispatching.diff to fix issue: 'Remove 64bit dispatching as not needed by libcudf and results in compiling twice as many kernels [https://github.com/rapidsai/cudf/pull/11437]'
[2023-08-17T03:32:07.236Z] [INFO]      [exec] -- rapids-cmake [Thrust]: applied diff thrust_faster_sort_compile_times.diff to fix issue: 'Improve Thrust sort compile times by not unrolling loops for inlined comparators [https://github.com/rapidsai/cudf/pull/10577]'
[2023-08-17T03:32:07.236Z] [INFO]      [exec] -- rapids-cmake [Thrust]: applied diff thrust_faster_scan_compile_times.diff to fix issue: 'Improve Thrust scan compile times by reducing the number of kernels generated [https://github.com/rapidsai/cudf/pull/8183]'
[2023-08-17T03:32:07.236Z] [INFO]      [exec] -- rapids-cmake [Thrust]: applied diff cub_segmented_sort_with_bool_key.diff to fix issue: 'Fix an error in CUB DeviceSegmentedSort when the keys are bool type [https://github.com/NVIDIA/cub/issues/594]'
[2023-08-17T03:32:07.236Z] [INFO]      [exec] -- Found CUB: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-513-cuda12/thirdparty/cudf/cpp/build/_deps/thrust-src/dependencies/cub/cub/cmake/cub-config.cmake (found suitable version "1.17.2.0", minimum required is "1.17.2.0") 
[2023-08-17T03:32:07.236Z] [INFO]      [exec] -- CPM: adding package [email protected] (branch-23.10)
...
[2023-08-17T03:33:29.627Z] [INFO]      [exec] -- CPM: adding package [email protected] (v0.5)
[2023-08-17T03:33:31.520Z] [INFO]      [exec] -- CPM: adding package [email protected] (branch/1.9.1)
[2023-08-17T03:34:39.147Z] [INFO]      [exec] -- Found libcudacxx: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-513-cuda12/thirdparty/cudf/cpp/build/_deps/libcudacxx-src/lib/cmake/libcudacxx/libcudacxx-config.cmake (found version "1.9.1.0") 
[2023-08-17T03:34:39.148Z] [INFO]      [exec] -- rapids-cmake [libcudacxx]: applied diff install_rules.diff to fix issue: 'libcudacxx 1.X installs incorrect files [https://github.com/NVIDIA/libcudacxx/pull/428]'
[2023-08-17T03:34:39.148Z] [INFO]      [exec] -- rapids-cmake [libcudacxx]: applied diff reroot_support.diff to fix issue: 'Support conda-forge usage of CMake rerooting [https://github.com/NVIDIA/libcudacxx/pull/490]'

@res-life
Copy link
Collaborator

res-life commented Sep 21, 2023

Reproduced the Sanitizer error on CUDA 12 with customized code, the ReductionTest.testShort test case caused the problem, the error in this case will cause other unexpected errors like OOM.

If I disabled the sanitizer running of ReductionTest.testShort, all other cases in ReductionTest passed.

The sanitizer error is:

========= COMPUTE-SANITIZER
========= Invalid __shared__ read of size 16 bytes
=========     at 0x4530 in void cub::CUB_101702_600_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_101702_600_700_750_800_860_900_NS::DeviceReducePolicy<short, short, int, cudf::DeviceMin>::Policy600, thrust::transform_iterator<cudf::null_replacing_transformer<short, thrust::identity<short>>, thrust::transform_iterator<cudf::detail::pair_accessor<short, (bool)1>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short *, int, cudf::DeviceMin, short>(T2, T3, T4, T5, T6)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x8 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2d18f2]
=========                in /usr/lib64/libcuda.so.1
=========     Host Frame: [0x312d6bb]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame: [0x316cd1b]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame: [0x1f2b5f6]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame:cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)::{lambda()#2}::operator()() const [0x1f3af70]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame:std::unique_ptr<cudf::scalar, std::default_delete<cudf::scalar> > cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f3b810]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame:cudf::reduction::detail::min(cudf::column_view const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f14a86]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame:decltype(auto) cudf::detail::aggregation_dispatcher<cudf::reduction::detail::reduce_dispatch_functor, cudf::reduce_aggregation const&>(cudf::aggregation::Kind, cudf::reduction::detail::reduce_dispatch_functor&&, cudf::reduce_aggregation const&) [0x1fe7b7a]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame:cudf::reduction::detail::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1fe2ac2]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame:cudf::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, rmm::mr::device_memory_resource*) [0x1fe323e]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame:Java_ai_rapids_cudf_ColumnView_reduce [0x11036b5]
=========                in /tmp/cudf8544281732759968662.so
=========     Host Frame: [0x147a0e407]
=========                in 

Above error shows:

Invalid __shared__ read of size 16 bytes
Address 0x8 is misaligned

Seems the cudf::reduction::simple::detail::simple_reduction has alignment related issue when reading short type value.

@bdice Do you have time to take a look?

@firestarman
Copy link
Collaborator

@res-life It would be good if you can share the repro case and code.

@res-life
Copy link
Collaborator

My reproduce step is:

compute-sanitizer --tool memcheck \
    --launch-timeout 600 \
    --error-exitcode -2 \
    --log-file "./sanitizer_for_pid_%p.log" \
    java -cp test-compute-sanitizer-1.0.jar:cudf-23.10.0-20230907.133820-30-cuda12.jar:slf4j-api-1.7.32.jar org.example.Main

The test-compute-sanitizer-1.0.jar is compiled from the test

  void testShort(ReductionAggregation op, Short[] values,
      HostColumnVector.DataType type, Object expectedObject, Double delta) {
    try (Scalar expected = buildExpectedScalar(op, type, expectedObject);
         ColumnVector v = ColumnVector.fromBoxedShorts(values);
         Scalar result = v.reduce(op, expected.getType())) {
      assertEqualsDelta(op, expected, result, delta);
    }
  }

You can get the test-compute-sanitizer-1.0.jar from:
sanitizer-cuda-12-error.zip

@bdice
Copy link
Contributor

bdice commented Sep 25, 2023

I have asked the cuDF team for help investigating here since I may not have enough time to look at this during 23.10 burndown. If you can create a pure C++ reproducer and file a PR to libcudf with the failing test, that would be great.

@res-life
Copy link
Collaborator

Reproduced by cpp code.
This issue depends on cuDF issue, refer to: rapidsai/cudf#14192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test
Projects
None yet
Development

No branches or pull requests

5 participants