Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] segfaults seen in cuDF after prefetch calls intermittently #11265

Closed
pxLi opened this issue Jul 27, 2024 · 5 comments
Closed

[BUG] segfaults seen in cuDF after prefetch calls intermittently #11265

pxLi opened this issue Jul 27, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@pxLi
Copy link
Collaborator

pxLi commented Jul 27, 2024

Describe the bug
first seen in a pre-merge run,

[2024-07-27T07:17:10.771Z] FAILED ../../src/main/python/avro_test.py::test_avro_input_meta[PERFILE-v1][DATAGEN_SEED=1722064586, TZ=UTC, INJECT_OOM] - ConnectionRefusedError: [Errno 111] Connection refused

crashed JVM,complete core dump file hs_err_pid1671911.log

[2024-07-27T07:17:35.605Z] + cat integration_tests/target/run_dir-20240727071626-DsZg/hs_err_pid1671911.log
[2024-07-27T07:17:35.606Z] #
[2024-07-27T07:17:35.606Z] # A fatal error has been detected by the Java Runtime Environment:
[2024-07-27T07:17:35.606Z] #
[2024-07-27T07:17:35.606Z] #  SIGSEGV (0xb) at pc=0x00007fe94b91050a, pid=1671911, tid=0x00007fe7d81b1700
[2024-07-27T07:17:35.606Z] #
[2024-07-27T07:17:35.606Z] # JRE version: OpenJDK Runtime Environment (8.0_412-b08) (build 1.8.0_412-8u412-ga-1~20.04.1-b08)
[2024-07-27T07:17:35.606Z] # Java VM: OpenJDK 64-Bit Server VM (25.412-b08 mixed mode linux-amd64 compressed oops)
[2024-07-27T07:17:35.606Z] # Problematic frame:
[2024-07-27T07:17:35.606Z] # C  [libstdc++.so.6+0xc250a]  std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)+0x12a
[2024-07-27T07:17:35.606Z] #
[2024-07-27T07:17:35.606Z] # Core dump written. Default location: /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-9827-ci-2/integration_tests/target/run_dir-20240727071626-DsZg/core or core.1671911
[2024-07-27T07:17:35.606Z] #
[2024-07-27T07:17:35.606Z] # If you would like to submit a bug report, please visit:
[2024-07-27T07:17:35.606Z] #   http://bugreport.java.com/bugreport/crash.jsp
[2024-07-27T07:17:35.606Z] # The crash happened outside the Java Virtual Machine in native code.
[2024-07-27T07:17:35.606Z] # See problematic frame for where to report the bug.
[2024-07-27T07:17:35.606Z] #
[2024-07-27T07:17:35.606Z] 
[2024-07-27T07:17:35.606Z] ---------------  T H R E A D  ---------------
[2024-07-27T07:17:35.606Z] 
[2024-07-27T07:17:35.606Z] Current thread (0x00007fe890021800):  JavaThread "Executor task launch worker for task 2.0 in stage 3.0 (TID 14)" daemon [_thread_in_native, id=1672943, stack(0x00007fe7d80b1000,0x00007fe7d81b2000)]
[2024-07-27T07:17:35.606Z] 
[2024-07-27T07:17:35.606Z] siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000010
[2024-07-27T07:17:35.606Z] 
[2024-07-27T07:17:35.606Z] Registers:
[2024-07-27T07:17:35.606Z] RAX=0x00007fe7902c75f0, RBX=0x00007fe7882f3b90, RCX=0x00007fe79058d2e0, RDX=0x0000000000000000
[2024-07-27T07:17:35.606Z] RSP=0x00007fe7d81adce8, RBP=0x00007fe807ffa088, RSI=0x00007fe7882f3b90, RDI=0x0000000000000000
[2024-07-27T07:17:35.606Z] R8 =0x00007fe807ffa090, R9 =0x00007fe7902c75f0, R10=0x00007fe79058d310, R11=0x0000000000000000
[2024-07-27T07:17:35.606Z] R12=0x00007fe79058d2e0, R13=0x00007fe7882f3b90, R14=0x0000000000000006, R15=0x00007fe807ffa090
[2024-07-27T07:17:35.606Z] RIP=0x00007fe94b91050a, EFLAGS=0x0000000000010206, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
[2024-07-27T07:17:35.606Z]   TRAPNO=0x000000000000000e
[2024-07-27T07:17:35.606Z] 
[2024-07-27T07:17:35.606Z] Top of Stack: (sp=0x00007fe7d81adce8)
[2024-07-27T07:17:35.606Z] 0x00007fe7d81adce8:   00007fe7e2748c16 00007fe79058d300
[2024-07-27T07:17:35.606Z] 0x00007fe7d81adcf8:   00007fe79058d310 00007fe79058d310
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add08:   00007fe7902c75f0 0000000000000015
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add18:   000000000000001d 0000000000000000
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add28:   0000000000000000 0000000000000000
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add38:   000000000000001d 00007fe7d81addb0
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add48:   00007fe7882f3b90 ffffffff7fffffff
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add58:   0000000000000006 000000000000001d
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add68:   00007fe7e27494cc 00007fe7d81addb0
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add78:   00007fe807ffa090 00007fe807ffa080
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add88:   00007fe7d81adda0 00007fe8a65cf560
[2024-07-27T07:17:35.606Z] 0x00007fe7d81add98:   00007fe7d81adda0 00007fe7d81addb0
[2024-07-27T07:17:35.606Z] 0x00007fe7d81adda8:   0000000000000000 0000726568746100
[2024-07-27T07:17:35.606Z] 0x00007fe7d81addb8:   0000000000000000 0000000000000003
[2024-07-27T07:17:35.606Z] 0x00007fe7d81addc8:   0000000000000002 00007fe7e31ebeee
[2024-07-27T07:17:35.606Z] 0x00007fe7d81addd8:   0000000000000006 0000000328011600
[2024-07-27T07:17:35.606Z] 0x00007fe7d81adde8:   000000000000af60 0000000000000000
[2024-07-27T07:17:35.606Z] 0x00007fe7d81addf8:   00007fe7e2749683 0000000328006400
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade08:   0000000000000000 0000000328011600
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade18:   0000000000000002 00007fe7d81ae020
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade28:   00007fe7d81ae530 000000000000af60
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade38:   0000000328001000 00007fe7d81ae530
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade48:   00007fe7e27497ca 00007fe7d81ae020
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade58:   00007fe7d81ae530 000000000000af60
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade68:   00007fe7e019c95d 0000000000000006
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade78:   00007fe807de9f60 000000000000af60
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade88:   000000000000af60 0000000000000000
[2024-07-27T07:17:35.606Z] 0x00007fe7d81ade98:   00000000000000f4 00007fe7d81adf80
[2024-07-27T07:17:35.606Z] 0x00007fe7d81adea8:   0000000328011600 00007fe8a49c5c50
[2024-07-27T07:17:35.606Z] 0x00007fe7d81adeb8:   00007fe81402a520 0000000000000002
[2024-07-27T07:17:35.606Z] 0x00007fe7d81adec8:   8000000000000006 0000000000000000
[2024-07-27T07:17:35.606Z] 0x00007fe7d81aded8:   0000000000000000 0000000000000000 
[2024-07-27T07:17:35.606Z] 
[2024-07-27T07:17:35.606Z] Instructions: (pc=0x00007fe94b91050a)
[2024-07-27T07:17:35.606Z] 0x00007fe94b9104ea:   84 00 00 00 00 00 48 39 72 10 0f 84 d6 00 00 00
[2024-07-27T07:17:35.606Z] 0x00007fe94b9104fa:   c7 02 01 00 00 00 48 8b 50 18 c7 00 00 00 00 00
[2024-07-27T07:17:35.606Z] 0x00007fe94b91050a:   48 8b 4a 10 48 89 48 18 48 85 c9 74 04 48 89 41
[2024-07-27T07:17:35.606Z] 0x00007fe94b91051a:   08 48 8b 48 08 48 89 4a 08 49 3b 40 08 0f 84 83 
[2024-07-27T07:17:35.606Z] 
[2024-07-27T07:17:35.606Z] Register to memory mapping:
[2024-07-27T07:17:35.606Z] 
[2024-07-27T07:17:35.606Z] RAX=0x00007fe7902c75f0 is an unknown value
[2024-07-27T07:17:35.606Z] RBX=0x00007fe7882f3b90 is an unknown value
[2024-07-27T07:17:35.606Z] RCX=0x00007fe79058d2e0 is an unknown value
[2024-07-27T07:17:35.606Z] RDX=0x0000000000000000 is an unknown value
[2024-07-27T07:17:35.606Z] RSP=0x00007fe7d81adce8 is pointing into the stack for thread: 0x00007fe890021800
[2024-07-27T07:17:35.606Z] RBP=0x00007fe807ffa088: <offset 0x29947088> in /tmp/cudf7436611480387827253.so at 0x00007fe7de6b3000
[2024-07-27T07:17:35.606Z] RSI=0x00007fe7882f3b90 is an unknown value
[2024-07-27T07:17:35.606Z] RDI=0x0000000000000000 is an unknown value
[2024-07-27T07:17:35.606Z] R8 =0x00007fe807ffa090: <offset 0x29947090> in /tmp/cudf7436611480387827253.so at 0x00007fe7de6b3000
[2024-07-27T07:17:35.607Z] R9 =0x00007fe7902c75f0 is an unknown value
[2024-07-27T07:17:35.607Z] R10=0x00007fe79058d310 is an unknown value
[2024-07-27T07:17:35.607Z] R11=0x0000000000000000 is an unknown value
[2024-07-27T07:17:35.607Z] R12=0x00007fe79058d2e0 is an unknown value
[2024-07-27T07:17:35.607Z] R13=0x00007fe7882f3b90 is an unknown value
[2024-07-27T07:17:35.607Z] R14=0x0000000000000006 is an unknown value
[2024-07-27T07:17:35.607Z] R15=0x00007fe807ffa090: <offset 0x29947090> in /tmp/cudf7436611480387827253.so at 0x00007fe7de6b3000
[2024-07-27T07:17:35.607Z] 
[2024-07-27T07:17:35.607Z] 
[2024-07-27T07:17:35.607Z] Stack: [0x00007fe7d80b1000,0x00007fe7d81b2000],  sp=0x00007fe7d81adce8,  free space=1011k
[2024-07-27T07:17:35.607Z] Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
[2024-07-27T07:17:35.607Z] C  [libstdc++.so.6+0xc250a]  std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)+0x12a
[2024-07-27T07:17:35.607Z] 
[2024-07-27T07:17:35.607Z] Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
[2024-07-27T07:17:35.607Z] j  ai.rapids.cudf.ColumnVector.fromScalar(JI)J+0
[2024-07-27T07:17:35.607Z] j  ai.rapids.cudf.ColumnVector.fromScalar(Lai/rapids/cudf/Scalar;I)Lai/rapids/cudf/ColumnVector;+5
[2024-07-27T07:17:35.607Z] j  org.apache.spark.sql.rapids.GpuInputFileName.$anonfun$columnarEval$1(Lorg/apache/spark/sql/rapids/GpuInputFileName;Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lai/rapids/cudf/Scalar;)Lcom/nvidia/spark/rapids/GpuColumnVector;+5
[2024-07-27T07:17:35.607Z] j  org.apache.spark.sql.rapids.GpuInputFileName$$Lambda$4250.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
[2024-07-27T07:17:35.607Z] j  org.apache.spark.sql.rapids.GpuInputFileName.columnarEval(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+22
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+8
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuAlias.columnarEval(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+11
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lcom/nvidia/spark/rapids/GpuColumnVector;+8
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lorg/apache/spark/sql/catalyst/expressions/Expression;)Lcom/nvidia/spark/rapids/GpuColumnVector;+8
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$$$Lambda$4191.apply(Ljava/lang/Object;)Ljava/lang/Object;+8
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(Lscala/collection/mutable/Builder;Lscala/Function1;Ljava/lang/Object;)V+6
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(Lscala/collection/mutable/Builder;Lscala/Function1;Ljava/lang/Object;)Ljava/lang/Object;+3
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely$$Lambda$4192.apply(Ljava/lang/Object;)Ljava/lang/Object;+9
[2024-07-27T07:17:35.607Z] J 8805 C2 scala.collection.immutable.List.foreach(Lscala/Function1;)V (32 bytes) @ 0x00007fe935ce7134 [0x00007fe935ce7080+0xb4]
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(Lscala/collection/SeqLike;Lscala/Function1;Lscala/collection/generic/CanBuildFrom;)Ljava/lang/Object;+16
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(Lscala/Function1;)Lscala/collection/Seq;+12
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$.project(Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lscala/collection/Seq;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+29
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectWithRetrySingleBatch$2(Lscala/collection/Seq;Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lscala/Some;+12
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$$$Lambda$4248.apply(Ljava/lang/Object;)Ljava/lang/Object;+8
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$.projectWithRetrySingleBatch(Lcom/nvidia/spark/rapids/SpillableColumnarBatch;Lscala/collection/Seq;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+125
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectAndCloseWithRetrySingleBatch$1(Lcom/nvidia/spark/rapids/SpillableColumnarBatch;Lscala/collection/Seq;Lcom/nvidia/spark/rapids/SpillableColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+5
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$$$Lambda$4245.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$.projectAndCloseWithRetrySingleBatch(Lcom/nvidia/spark/rapids/SpillableColumnarBatch;Lscala/collection/Seq;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+11
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuTieredProject.$anonfun$projectWithRetrySingleBatchInternal$6(ZLcom/nvidia/spark/rapids/SpillableColumnarBatch;Lscala/collection/Seq;Lai/rapids/cudf/NvtxRange;)Lcom/nvidia/spark/rapids/SpillableColumnarBatch;+9
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuTieredProject$$Lambda$4243.apply(Ljava/lang/Object;)Ljava/lang/Object;+16
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuTieredProject.recurse$1(Lscala/collection/Seq;Lcom/nvidia/spark/rapids/SpillableColumnarBatch;Z)Lcom/nvidia/spark/rapids/SpillableColumnarBatch;+80
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuTieredProject.projectWithRetrySingleBatchInternal(Lcom/nvidia/spark/rapids/SpillableColumnarBatch;Z)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+56
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuTieredProject.projectAndCloseWithRetrySingleBatch(Lcom/nvidia/spark/rapids/SpillableColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+3
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec.$anonfun$internalDoExecuteColumnar$2(Lorg/apache/spark/sql/vectorized/ColumnarBatch;Lcom/nvidia/spark/rapids/GpuTieredProject;Lcom/nvidia/spark/rapids/NvtxWithMetrics;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+16
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.GpuProjectExec$$Lambda$4229.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
[2024-07-27T07:17:35.607Z] j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
[2024-07-27T07:17:35.608Z] j  com.nvidia.spark.rapids.GpuProjectExec.$anonfun$internalDoExecuteColumnar$1(Lcom/nvidia/spark/rapids/GpuMetric;Lcom/nvidia/spark/rapids/GpuTieredProject;Lcom/nvidia/spark/rapids/GpuMetric;Lcom/nvidia/spark/rapids/GpuMetric;Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+41
[2024-07-27T07:17:35.608Z] j  com.nvidia.spark.rapids.GpuProjectExec$$Lambda$4040.apply(Ljava/lang/Object;)Ljava/lang/Object;+20
[2024-07-27T07:17:35.608Z] J 7909 C2 scala.collection.Iterator$$anon$10.next()Ljava/lang/Object; (19 bytes) @ 0x00007fe935567900 [0x00007fe9355678a0+0x60]
[2024-07-27T07:17:35.608Z] j  com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(Lcom/nvidia/spark/rapids/ColumnarToRowIterator;Ljava/lang/Object;Lcom/nvidia/spark/rapids/NvtxWithMetrics;)Lscala/None$;+24
[2024-07-27T07:17:35.608Z] j  com.nvidia.spark.rapids.ColumnarToRowIterator$$Lambda$4048.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
[2024-07-27T07:17:35.608Z] j  com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2
[2024-07-27T07:17:35.608Z] j  com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch()Lscala/Option;+51
[2024-07-27T07:17:35.608Z] j  com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch()V+10
[2024-07-27T07:17:35.608Z] j  com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext()Z+30
[2024-07-27T07:17:35.608Z] J 7886 C2 scala.collection.Iterator$$anon$10.hasNext()Z (10 bytes) @ 0x00007fe93540d664 [0x00007fe93540d620+0x44]
[2024-07-27T07:17:35.608Z] j  org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(ZILscala/collection/Iterator;)Lscala/collection/Iterator;+199
[2024-07-27T07:17:35.608Z] j  org.apache.spark.sql.execution.SparkPlan$$Lambda$3397.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
[2024-07-27T07:17:35.608Z] j  org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(Lscala/Function1;Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+2
[2024-07-27T07:17:35.608Z] j  org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;+7
[2024-07-27T07:17:35.608Z] j  org.apache.spark.rdd.RDD$$Lambda$3400.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
[2024-07-27T07:17:35.608Z] j  org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
[2024-07-27T07:17:35.608Z] j  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
[2024-07-27T07:17:35.608Z] j  org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+42
[2024-07-27T07:17:35.608Z] j  org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+203
[2024-07-27T07:17:35.608Z] j  org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;Lscala/collection/immutable/Map;Lscala/Option;)Ljava/lang/Object;+226
[2024-07-27T07:17:35.608Z] j  org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Lorg/apache/spark/executor/Executor$TaskRunner;Lscala/runtime/BooleanRef;)Ljava/lang/Object;+36
[2024-07-27T07:17:35.608Z] j  org.apache.spark.executor.Executor$TaskRunner$$Lambda$2443.apply()Ljava/lang/Object;+8
[2024-07-27T07:17:35.608Z] j  org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+4
[2024-07-27T07:17:35.608Z] j  org.apache.spark.executor.Executor$TaskRunner.run()V+443
[2024-07-27T07:17:35.608Z] j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
[2024-07-27T07:17:35.608Z] j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
[2024-07-27T07:17:35.608Z] j  java.lang.Thread.run()V+11
[2024-07-27T07:17:35.608Z] v  ~StubRoutines::call_stub

Steps/Code to reproduce bug
not always repro (intermittently, non-related to DATAGEN_SEED)

INCLUDE_SPARK_AVRO_JAR=true TEST='avro_test.py' ./integration_tests/run_pyspark_from_build.sh

Expected behavior
pass the test

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 27, 2024
@pxLi pxLi changed the title [BUG] test_avro_input_meta[PERFILE-v1][DATAGEN_SEED=1722064586, TZ=UTC, INJECT_OOM] core dump [BUG] test_avro_input_meta[PERFILE-v1][DATAGEN_SEED=1722064586, TZ=UTC, INJECT_OOM] core dump intermittently Jul 27, 2024
@abellina
Copy link
Collaborator

I can sometimes repro this locally. I have a hunch that it is related to this line rapidsai/cudf@e6537de#diff-51063a0e9329a153db632035df960b0626c9a464a852250778f27bedd53b0972R39. I know this line is getting executed from different threads and that we are mutating the same key without a lock. I added some prints in this area and found that behavior:

../../src/main/python/avro_test.py::test_avro_input_meta[PERFILE-v1][DATAGEN_SEED=1722064586, TZ=UTC, INJECT_OOM] 
setting default to not prefetch for column_view::get_data
setting default to not prefetch for column_view::get_data
setting default to not prefetch for mutable_column_view::get_data
setting default to not prefetch for gather
setting default to not prefetch for gather

The fact that we have repeated lines means multiple threads got here. So a possibility is that we are getting unlucky in some cases and two threads race in the wrong place within STL and corrupt some memory, STL containers are not thread safe.

@abellina
Copy link
Collaborator

confirming that a patch with a lock fixes the segfault after several iterations. I am going to PR to cuDF 24.08.

@pxLi
Copy link
Collaborator Author

pxLi commented Jul 30, 2024

JNI with the fix has been deployed.

@pxLi pxLi closed this as completed Jul 30, 2024
@pxLi
Copy link
Collaborator Author

pxLi commented Jul 30, 2024

reopen for 24.10, new JNI is still building

@pxLi pxLi reopened this Jul 30, 2024
@pxLi pxLi changed the title [BUG] test_avro_input_meta[PERFILE-v1][DATAGEN_SEED=1722064586, TZ=UTC, INJECT_OOM] core dump intermittently [BUG] segfaults seen in cuDF after prefetch calls intermittently Jul 30, 2024
@pxLi
Copy link
Collaborator Author

pxLi commented Jul 30, 2024

close as new JNI 24.10.0-SNAPSHOT with rapidsai/cudf#16425 is available in sonatype

@pxLi pxLi closed this as completed Jul 30, 2024
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants