Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] udf-examples-native case failed core dump #11842

Closed
pxLi opened this issue Dec 9, 2024 · 5 comments
Closed

[BUG] udf-examples-native case failed core dump #11842

pxLi opened this issue Dec 9, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@pxLi
Copy link
Collaborator

pxLi commented Dec 9, 2024

Describe the bug
first seen in examples-udf-examples-native run:179
https://github.com/NVIDIA/spark-rapids-examples/tree/branch-24.12/examples/UDF-Examples/RAPIDS-accelerated-UDFs

17:42:00  24/12/09 09:42:00 INFO RapidsPluginUtils: RAPIDS Accelerator build: Map(url -> https://github.com/NVIDIA/spark-rapids.git, branch -> HEAD, 
revision -> fb2f72df881582855393135d6e574111716ec7bb, version -> 24.12.0-SNAPSHOT, date -> 2024-12-08T10:18:05Z, cudf_version -> 24.12.0-SNAPSHOT, user -> root)
17:42:00  24/12/09 09:42:00 INFO RapidsPluginUtils: RAPIDS Accelerator JNI build: Map(url -> https://github.com/NVIDIA/spark-rapids-jni.git, branch -> HEAD, gpu_architectures -> 70;75;80;86;90, 
revision -> 7842da04bd6486f2389c441f0e1aa094c5eef469, version -> 24.12.0-SNAPSHOT, date -> 2024-12-08T05:18:03Z, user -> root)
17:42:00  24/12/09 09:42:00 INFO RapidsPluginUtils: cudf build: Map(url -> https://github.com/rapidsai/cudf.git, branch -> HEAD, gpu_architectures -> 70;75;80;86;90, 
revision -> 439321edb43082fb75f195b6be2049c925279089, version -> 24.12.0-SNAPSHOT, date -> 2024-12-08T05:17:59Z, user -> root)
17:42:00  24/12/09 09:42:00 INFO RapidsPluginUtils: RAPIDS Accelerator Private Map(url -> https://gitlab-master.nvidia.com/nvspark/spark-rapids-private.git, branch -> HEAD, 
revision -> 2f08e20170b66621d1f14ee0fb351ef5630ea811, version -> 24.12.0-SNAPSHOT, date -> 2024-12-08T03:33:41Z, user -> root)
...
...
...
17:42:11  ============================= test session starts ==============================
17:42:11  platform linux -- Python 3.10.16, pytest-7.4.4, pluggy-1.5.0 -- /opt/conda/bin/python3
17:42:11  cachedir: .pytest_cache
17:42:11  rootdir: /home/jenkins/agent/workspace/jenkins-examples-udf-examples-native-179/examples/UDF-Examples/RAPIDS-accelerated-UDFs
17:42:11  configfile: pytest.ini
17:42:11  plugins: order-1.3.0, xdist-3.6.1
17:42:11  collecting ... collected 8 items
17:42:11  
17:42:16  src/main/python/rapids_udf_test.py::test_hive_simple_udf 24/12/09 09:42:16 WARN GpuOverrides: 
17:42:16    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
17:42:16      @Expression <AttributeReference> i#7 could run on GPU
17:42:16      @Expression <AttributeReference> s#8 could run on GPU
17:42:16  
17:42:17  PASSED          [ 12%]
17:42:17  src/main/python/rapids_udf_test.py::test_hive_generic_udf 24/12/09 09:42:17 WARN GpuOverrides: 
17:42:17    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
17:42:17      @Expression <AttributeReference> s#18 could run on GPU
17:42:17  
17:42:18  24/12/09 09:42:17 WARN GpuOverrides: 
17:42:18    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
17:42:18      @Expression <AttributeReference> dec#26 could run on GPU
17:42:18  
17:42:18  PASSED         [ 25%]
17:42:18  src/main/python/rapids_udf_test.py::test_hive_simple_udf_native 24/12/09 09:42:18 WARN GpuOverrides: 
17:42:18    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
17:42:18      @Expression <AttributeReference> s#34 could run on GPU
17:42:18  
17:42:18  #
17:42:18  # A fatal error has been detected by the Java Runtime Environment:
17:42:18  #
17:42:18  #  SIGSEGV (0xb) at pc=0x00007f02da529598, pid=177, tid=0x00007f02c9bff700
17:42:18  #
17:42:18  # JRE version: OpenJDK Runtime Environment (8.0_432) (build 1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga)
17:42:18  # Java VM: OpenJDK 64-Bit Server VM (25.432-bga mixed mode linux-amd64 compressed oops)
17:42:18  # Problematic frame:
17:42:18  # C  [libcuda.so.1+0x186598]

core dump: (complete file hs_err_pid177.log)

---------------  T H R E A D  ---------------

Current thread (0x00007f0244029000):  JavaThread "Executor task launch worker for task 0.0 in stage 7.0 (TID 7)" daemon [_thread_in_native, id=361, stack(0x00007f02c9aff000,0x00007f02c9c00000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x00007f01a73a3f10

Registers:
RAX=0x00007f01a73a3f40, RBX=0x0000000000000001, RCX=0x00007f02ca600100, RDX=0x0000000000000100
RSP=0x00007f02c9bfa8d8, RBP=0x00007f02c9bfa8e0, RSI=0x00007f02ca600040, RDI=0x00007f01a73a3f00
R8 =0x0000000000000100, R9 =0x0000000000000000, R10=0x0000000000000004, R11=0x0000000000000008
R12=0x0000000000000100, R13=0x00007f02ca600000, R14=0x00007f01a73a3f00, R15=0x0000000000000001
RIP=0x00007f02da529598, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000007
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f02c9bfa8d8)
0x00007f02c9bfa8d8:   0000000000000001 00007f02c9bfa970
0x00007f02c9bfa8e8:   00007f02da61ab5a 00007f02c9bfa940
0x00007f02c9bfa8f8:   00007f02c9bfb3a8 0000000000000100
0x00007f02c9bfa908:   0000000000000100 00007f01a73a3f00
0x00007f02c9bfa918:   00007f02ca600000 00007f02c9bfaee0
0x00007f02c9bfa928:   0000000000000000 0000000000000100
0x00007f02c9bfa938:   0000000000000100 00007f02c9bfa9a0
0x00007f02c9bfa948:   00007f0130d3a280 00007f02c9bfb3a8
0x00007f02c9bfa958:   0000000000000000 00007f02c9bfaee0
0x00007f02c9bfa968:   00007f02c9bfaee0 00007f02c9bfa9a0
0x00007f02c9bfa978:   00007f02da76483b 0000000000000000
0x00007f02c9bfa988:   00007f0130d3a280 00007f02c9bfb3a8
0x00007f02c9bfa998:   0000000000000002 00007f02c9bfb2c0
0x00007f02c9bfa9a8:   00007f02da765101 0000000000000001
0x00007f02c9bfa9b8:   00007f02c9bfaee0 0000000200000000
0x00007f02c9bfa9c8:   00007f02c9bfaa50 0000000000000001
0x00007f02c9bfa9d8:   0000000000000000 00007f0259059fc0
0x00007f02c9bfa9e8:   00007f02c9bfb3a8 00007f0130d3a4d8
0x00007f02c9bfa9f8:   0000000000000000 0000000000000001
0x00007f02c9bfaa08:   00007f02c9bfab20 0000000000000001
0x00007f02c9bfaa18:   0000000000000001 0000000000000001
0x00007f02c9bfaa28:   0000000000000100 0000000000000100
0x00007f02c9bfaa38:   00007f0259057b00 00007f0259057b00
0x00007f02c9bfaa48:   0000000000000000 0000000000000001
0x00007f02c9bfaa58:   0000000000000000 0000000000000000
0x00007f02c9bfaa68:   0000000000000000 00007f0259059160
0x00007f02c9bfaa78:   0000000000000000 0000000000000100
0x00007f02c9bfaa88:   0000000000000001 0000000000000000
0x00007f02c9bfaa98:   0000000000000000 0000000000000000
0x00007f02c9bfaaa8:   0000000000000000 0000000000000000
0x00007f02c9bfaab8:   0000000000000000 0000000000000000
0x00007f02c9bfaac8:   0000000000000000 0000000000000000 

Instructions: (pc=0x00007f02da529598)
0x00007f02da529578:   f8 48 8d 0c 16 0f 1f 00 0f 28 56 10 0f 28 4e 20
0x00007f02da529588:   48 83 c6 40 48 83 c0 40 0f 28 46 f0 0f 28 66 c0
0x00007f02da529598:   0f 2b 50 d0 0f 2b 60 c0 0f 2b 48 e0 0f 2b 40 f0
0x00007f02da5295a8:   48 39 ce 75 d3 48 01 fa 41 83 e0 3f 49 83 f8 0f 

Register to memory mapping:

RAX=0x00007f01a73a3f40: <offset 0x2a57af40> in /tmp/cudf760737925635621984.so at 0x00007f017ce29000
RBX=0x0000000000000001 is an unknown value
RCX=0x00007f02ca600100 is an unknown value
RDX=0x0000000000000100 is an unknown value
RSP=0x00007f02c9bfa8d8 is pointing into the stack for thread: 0x00007f0244029000
RBP=0x00007f02c9bfa8e0 is pointing into the stack for thread: 0x00007f0244029000
RSI=0x00007f02ca600040 is an unknown value
RDI=0x00007f01a73a3f00: <offset 0x2a57af00> in /tmp/cudf760737925635621984.so at 0x00007f017ce29000
R8 =0x0000000000000100 is an unknown value
R9 =0x0000000000000000 is an unknown value
R10=0x0000000000000004 is an unknown value
R11=0x0000000000000008 is an unknown value
R12=0x0000000000000100 is an unknown value
R13=0x00007f02ca600000 is an unknown value
R14=0x00007f01a73a3f00: <offset 0x2a57af00> in /tmp/cudf760737925635621984.so at 0x00007f017ce29000
R15=0x0000000000000001 is an unknown value


Stack: [0x00007f02c9aff000,0x00007f02c9c00000],  sp=0x00007f02c9bfa8d8,  free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libcuda.so.1+0x186598]
C  [libcuda.so.1+0x277b5a]
C  [libcuda.so.1+0x3c183b]
C  [libcuda.so.1+0x3c2101]
C  [libcuda.so.1+0x4faf9b]
C  [libcuda.so.1+0x13b116]
C  [libcuda.so.1+0x13b529]
C  [libcuda.so.1+0x13bdc7]
C  [libcuda.so.1+0x2dbca1]
C  [cudf760737925635621984.so+0x39eafc1]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.nvidia.spark.rapids.udf.hive.StringWordCount.countWords(J)J+0
j  com.nvidia.spark.rapids.udf.hive.StringWordCount.evaluateColumnar(I[Lai/rapids/cudf/ColumnVector;)Lai/rapids/cudf/ColumnVector;+141
j  com.nvidia.spark.rapids.GpuUserDefinedFunction.$anonfun$columnarEval$4(Lcom/nvidia/spark/rapids/GpuUserDefinedFunction;Lorg/apache/spark/sql/vectorized/ColumnarBatch;[Lai/rapids/cudf/ColumnVector;Lai/rapids/cudf/NvtxRange;)Lcom/nvidia/spark/rapids/GpuColumnVector;+11
j  com.nvidia.spark.rapids.GpuUserDefinedFunction$$Lambda$3508.apply(Ljava/lang/Object;)Ljava/lang/Object;+16
j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
j  com.nvidia.spark.rapids.GpuUserDefinedFunction.$anonfun$columnarEval$2(Lcom/nvidia/spark/rapids/GpuUserDefinedFunction;Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lscala/collection/Seq;)Lcom/nvidia/spark/rapids/GpuColumnVector;+64
j  com.nvidia.spark.rapids.GpuUserDefinedFunction$$Lambda$3506.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
j  com.nvidia.spark.rapids.Arm$.withResource(Lscala/collection/Seq;Lscala/Function1;)Ljava/lang/Object;+2
j  com.nvidia.spark.rapids.GpuUserDefinedFunction.columnarEval(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+34
j  com.nvidia.spark.rapids.GpuUserDefinedFunction.columnarEval$(Lcom/nvidia/spark/rapids/GpuUserDefinedFunction;Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+2
j  org.apache.spark.sql.hive.rapids.GpuHiveSimpleUDF.columnarEval(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+2
j  com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+8
j  com.nvidia.spark.rapids.GpuAlias.columnarEval(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+11
j  com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+8
j  com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lorg/apache/spark/sql/catalyst/expressions/Expression;)Lcom/nvidia/spark/rapids/GpuColumnVector;+8
j  com.nvidia.spark.rapids.GpuProjectExec$$$Lambda$3503.apply(Ljava/lang/Object;)Ljava/lang/Object;+8
j  com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(Lscala/collection/mutable/Builder;Lscala/Function1;Ljava/lang/Object;)V+6
j  com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(Lscala/collection/mutable/Builder;Lscala/Function1;Ljava/lang/Object;)Ljava/lang/Object;+3
j  com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely$$Lambda$3504.apply(Ljava/lang/Object;)Ljava/lang/Object;+9
J 9093 C2 scala.collection.immutable.List.foreach(Lscala/Function1;)V (32 bytes) @ 0x00007f030276a274 [0x00007f030276a1c0+0xb4]
j  com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(Lscala/collection/SeqLike;Lscala/Function1;Lscala/collection/generic/CanBuildFrom;)Ljava/lang/Object;+16
j  com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(Lscala/Function1;)Lscala/collection/Seq;+12
j  com.nvidia.spark.rapids.GpuProjectExec$.project(Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lscala/collection/Seq;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+44
j  com.nvidia.spark.rapids.GpuTieredProject.$anonfun$project$2(Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lscala/collection/Seq;Lai/rapids/cudf/NvtxRange;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+5
j  com.nvidia.spark.rapids.GpuTieredProject$$Lambda$3499.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
j  com.nvidia.spark.rapids.GpuTieredProject.recurse$2(Lscala/collection/Seq;Lorg/apache/spark/sql/vectorized/ColumnarBatch;Z)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+79
j  com.nvidia.spark.rapids.GpuTieredProject.project(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+7
j  com.nvidia.spark.rapids.GpuTieredProject.$anonfun$projectWithRetrySingleBatchInternal$5(Lcom/nvidia/spark/rapids/GpuTieredProject;Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+2
j  com.nvidia.spark.rapids.GpuTieredProject$$Lambda$3498.apply()Ljava/lang/Object;+8
j  com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRestoreOnRetry(Lscala/collection/Seq;Lscala/Function0;)Ljava/lang/Object;+1
j  com.nvidia.spark.rapids.GpuTieredProject.$anonfun$projectWithRetrySingleBatchInternal$4(Lcom/nvidia/spark/rapids/GpuTieredProject;Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+14
j  com.nvidia.spark.rapids.GpuTieredProject$$Lambda$3497.apply(Ljava/lang/Object;)Ljava/lang/Object;+8
j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
j  com.nvidia.spark.rapids.GpuTieredProject.$anonfun$projectWithRetrySingleBatchInternal$3(Lcom/nvidia/spark/rapids/GpuTieredProject;Lcom/nvidia/spark/rapids/SpillableColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+15
j  com.nvidia.spark.rapids.GpuTieredProject$$Lambda$3493.apply()Ljava/lang/Object;+8
j  com.nvidia.spark.rapids.RmmRapidsRetryIterator$NoInputSpliterator.next()Ljava/lang/Object;+7
j  com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next()Ljava/lang/Object;+116
j  com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next()Ljava/lang/Object;+18
j  com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(Lscala/collection/Iterator;)Ljava/lang/Object;+18
j  com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(Lscala/Function0;)Ljava/lang/Object;+18
j  com.nvidia.spark.rapids.GpuTieredProject.$anonfun$projectWithRetrySingleBatchInternal$1(Lcom/nvidia/spark/rapids/GpuTieredProject;Lcom/nvidia/spark/rapids/SpillableColumnarBatch;Lscala/Option;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+24
j  com.nvidia.spark.rapids.GpuTieredProject$$Lambda$3488.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
j  com.nvidia.spark.rapids.Arm$.withResource(Lscala/Option;Lscala/Function1;)Ljava/lang/Object;+2
j  com.nvidia.spark.rapids.GpuTieredProject.projectWithRetrySingleBatchInternal(Lcom/nvidia/spark/rapids/SpillableColumnarBatch;Z)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+37
j  com.nvidia.spark.rapids.GpuTieredProject.projectAndCloseWithRetrySingleBatch(Lcom/nvidia/spark/rapids/SpillableColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+3
j  com.nvidia.spark.rapids.GpuProjectExec.$anonfun$internalDoExecuteColumnar$2(Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lcom/nvidia/spark/rapids/GpuTieredProject;Lcom/nvidia/spark/rapids/NvtxWithMetrics;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+16
j  com.nvidia.spark.rapids.GpuProjectExec$$Lambda$3470.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
j  com.nvidia.spark.rapids.GpuProjectExec.$anonfun$internalDoExecuteColumnar$1(Lcom/nvidia/spark/rapids/GpuMetric;Lcom/nvidia/spark/rapids/GpuTieredProject;Lcom/nvidia/spark/rapids/GpuMetric;Lcom/nvidia/spark/rapids/GpuMetric;Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+41
j  com.nvidia.spark.rapids.GpuProjectExec$$Lambda$3432.apply(Ljava/lang/Object;)Ljava/lang/Object;+20
J 7502 C2 scala.collection.Iterator$$anon$10.next()Ljava/lang/Object; (19 bytes) @ 0x00007f03024010c0 [0x00007f0302401060+0x60]
j  com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(Lcom/nvidia/spark/rapids/ColumnarToRowIterator;Ljava/lang/Object;Lcom/nvidia/spark/rapids/NvtxWithMetrics;)Lscala/None$;+24
j  com.nvidia.spark.rapids.ColumnarToRowIterator$$Lambda$3438.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
j  com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch()Lscala/Option;+51
j  com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch()V+10
J 10272 C2 com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext()Z (62 bytes) @ 0x00007f0302a22ccc [0x00007f0302a22c60+0x6c]
J 7506 C2 scala.collection.Iterator$$anon$10.hasNext()Z (10 bytes) @ 0x00007f0302388664 [0x00007f0302388620+0x44]
j  org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(ZILscala/collection/Iterator;)Lscala/collection/Iterator;+199
j  org.apache.spark.sql.execution.SparkPlan$$Lambda$2626.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
j  org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(Lscala/Function1;Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+2
j  org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;+7
j  org.apache.spark.rdd.RDD$$Lambda$2627.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
j  org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
j  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j  org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+42
j  org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+203
j  org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;Lscala/collection/immutable/Map;Lscala/Option;)Ljava/lang/Object;+226
j  org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Lorg/apache/spark/executor/Executor$TaskRunner;Lscala/runtime/BooleanRef;)Ljava/lang/Object;+36
j  org.apache.spark.executor.Executor$TaskRunner$$Lambda$2582.apply()Ljava/lang/Object;+8
j  org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+4
j  org.apache.spark.executor.Executor$TaskRunner.run()V+443
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

Steps/Code to reproduce bug
build and test case at: https://github.com/NVIDIA/spark-rapids-examples/blob/branch-24.12/examples/UDF-Examples/RAPIDS-accelerated-UDFs/README.md#building-and-run-the-tests-without-native-code-examples
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-24.12/examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/rapids_udf_test.py

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi pxLi added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 9, 2024
@pxLi
Copy link
Collaborator Author

pxLi commented Dec 9, 2024

As it failed with 24.12,
open the ticket to spark-rapids instead of the example repo to determine the impact first

cc @sameerz

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 10, 2024
@jihoonson
Copy link
Collaborator

Thanks @pxLi for filing the issue. Looking into it.

@jihoonson
Copy link
Collaborator

So, I am able to produce an issue locally that looks quite similar to the one reported here. The stack trace and the error message are not the exact same, but the same test fails within the same docker container as the jenkins job uses. Here is the error from my local setting. Note that the pc is null in the below, whereas it is some non-null value in the above.

...
PASSED         [ 25%]
src/main/python/rapids_udf_test.py::test_hive_simple_udf_native 24/12/28 02:31:47 WARN GpuOverrides:
  ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
    @Expression <AttributeReference> s#34 could run on GPU

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000000, pid=5045, tid=0x0000764457fff700
#
# JRE version: OpenJDK Runtime Environment (8.0_432) (build 1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga)
# Java VM: OpenJDK 64-Bit Server VM (25.432-bga mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  0x0000000000000000
...

I managed to narrow down when this error stated. The error seems to have been caused by NVIDIA/cccl#2266. With the exact same versions of cuda, cudf, spark-rapids-jni, and the plugin, the same test passes with the cccl older than the commit NVIDIA/cccl@f53e72555. But it fails with the cccl at or after that commit. Now I'm trying to reproduce the issue within a cudf c++ unit test.

@jihoonson
Copy link
Collaborator

So, I tried to reproduce this issue within a cudf c++ unit test, cudf java unit test, spark-rapids-jni c++ unit test, and spark-rapids-jni java unit test. I captured the input a failed run used as a parquet file, copied over the exact source code from the examples repo, added a unit test that reads the captured parquet file and calls the native UDF. However, I was not able to reproduce it in any unit test. Based on this, I think that this is likely some problem in the examples repo, rather than cccl, cudf or the plugin.

While looking at the logs, I noticed one thing. cudf, spark-rapids-jni, and spark-rapids-examples use cccl, but all different versions. Especially the spark-rapids-examples used to use 2.5.0 which is very much outdated. After I updated the spark-rapids-examples to use the cccl 2.7.0, the error has gone in my test environment. This observation seems quite aligned with the recent history of jenkins runs that they have been running successfully after they were updated to use the most recent version of cudf and the plugin. I think we can close this issue for now, and repoen if the issue comes back. @pxLi @mattahrens let me know what you think.

@pxLi
Copy link
Collaborator Author

pxLi commented Jan 3, 2025

cudf, spark-rapids-jni, and spark-rapids-examples use cccl, but all different versions.

Yes, the example is built against directly to cudf code instead of relying on jni or plugin.

Thanks! we are good to close this ticket

@pxLi pxLi closed this as completed Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants