Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] UCX join_test FAILED on spark standalone #3334

Closed
NvTimLiu opened this issue Aug 30, 2021 · 2 comments · Fixed by #3412
Closed

[BUG] UCX join_test FAILED on spark standalone #3334

NvTimLiu opened this issue Aug 30, 2021 · 2 comments · Fixed by #3412
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P0 Must have for release

Comments

@NvTimLiu
Copy link
Collaborator

Describe the bug
The test fails because cuDF failure: 'cpp/include/cudf/detail/utilities/cuda.cuh:65: num_blocks must be > 0'

/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py:326: Py4JJavaError
 ----------------------------- Captured stdout call -----------------------------
 ### CPU RUN ###
 ### GPU RUN ###
____ test_left_broadcast_nested_loop_join_with_ast_condition[1g-Timestamp] _____
 
 data_gen = Timestamp, batch_size = '1g'
 
     @ignore_order(local=True)
     @pytest.mark.parametrize('data_gen', ast_gen, ids=idfn)
     @pytest.mark.parametrize('batch_size', ['100', '1g'], ids=idfn) # set the batch size so we can test multiple stream batches
     def test_left_broadcast_nested_loop_join_with_ast_condition(data_gen, batch_size):
         def do_join(spark):
             left, right = create_df(spark, data_gen, 50, 25)
             # This test is impacted by https://github.com/NVIDIA/spark-rapids/issues/294
             # if the sizes are large enough to have both 0.0 and -0.0 show up 500 and 250
             # but these take a long time to verify so we run with smaller numbers by default
             # that do not expose the error
             return broadcast(left).join(right, (left.b >= right.r_b), 'Right')
         conf = {'spark.rapids.sql.batchSizeBytes': batch_size}
         conf.update(allow_negative_scale_of_decimal_conf)
 >       assert_gpu_and_cpu_are_equal_collect(do_join, conf=conf)
 
../../src/main/python/join_test.py:383: 
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../src/main/python/asserts.py:440: in assert_gpu_and_cpu_are_equal_collect
     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
../../src/main/python/asserts.py:421: in _assert_gpu_and_cpu_are_equal
     run_on_gpu()
../../src/main/python/asserts.py:415: in run_on_gpu
     from_gpu = with_gpu_session(bring_back, conf=conf)
../../src/main/python/spark_session.py:105: in with_gpu_session
     return with_spark_session(func, conf=copy)
../../src/main/python/spark_session.py:70: in with_spark_session
     ret = func(_spark)
../../src/main/python/asserts.py:196: in <lambda>
     bring_back = lambda spark: limit_func(spark).collect()
/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/dataframe.py:677: in collect
     sock_info = self._jdf.collectToPython()
/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py:1304: in __call__
     return_value = get_return_value(
/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py:111: in deco
     return f(*a, **kw)
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 
 answer = 'xro1475476'
 gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f2da870e460>
 target_id = 'o1475475', name = 'collectToPython'
 
     def get_return_value(answer, gateway_client, target_id=None, name=None):
         """Converts an answer received from the Java gateway into a Python object.
     
         For example, string representation of integers are converted to Python
         integer, string representation of objects are converted to JavaObject
         instances, etc.
     
         :param answer: the string returned by the Java gateway
         :param gateway_client: the gateway client used to communicate with the Java
             Gateway. Only necessary if the answer is a reference (e.g., object,
             list, map)
         :param target_id: the name of the object from which the answer comes from
             (e.g., *object1* in `object1.hello()`). Optional.
         :param name: the name of the member from which the answer comes from
             (e.g., *hello* in `object1.hello()`). Optional.
         """
         if is_error(answer)[0]:
             if len(answer) > 1:
                 type = answer[1]
                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
                 if answer[1] == REFERENCE_TYPE:
 >                   raise Py4JJavaError(
                         "An error occurred while calling {0}{1}{2}.\n".
                         format(target_id, ".", name), value)
                    py4j.protocol.Py4JJavaError: An error occurred while calling o1475475.collectToPython.
                    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 36117.0 failed 1 times, most recent failure: Lost task 0.0 in stage 36117.0 (TID 1782978) (10.136.6.4 executor 2): ai.rapids.cudf.CudfException: cuDF failure at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-435-cuda11/cpp/include/cudf/detail/utilities/cuda.cuh:65: num_blocks must be > 0
                    	at ai.rapids.cudf.Table.conditionalLeftJoinGatherMaps(Native Method)
                    	at ai.rapids.cudf.Table.conditionalLeftJoinGatherMaps(Table.java:2176)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.$anonfun$computeGatherMaps$6(GpuBroadcastNestedLoopJoinExec.scala:270)
                    	at scala.Option.getOrElse(Option.scala:189)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.computeGatherMaps(GpuBroadcastNestedLoopJoinExec.scala:270)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.$anonfun$createGatherer$3(GpuBroadcastNestedLoopJoinExec.scala:242)
                    	at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:87)
                    	at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:85)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.closeOnExcept(AbstractGpuJoinIterator.scala:36)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.$anonfun$createGatherer$2(GpuBroadcastNestedLoopJoinExec.scala:236)
                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:36)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.$anonfun$createGatherer$1(GpuBroadcastNestedLoopJoinExec.scala:235)
                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:36)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.createGatherer(GpuBroadcastNestedLoopJoinExec.scala:234)
                    	at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$4(AbstractGpuJoinIterator.scala:221)
                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:36)
                    	at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:220)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:80)
                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:223)
                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:178)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:222)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:199)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:239)
                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
                    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
                    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
                    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
                    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
                    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
                    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
                    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
                    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
                    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
                    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
                    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
                    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    	at java.lang.Thread.run(Thread.java:748)
                    
                    Driver stacktrace:
                    	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
                    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
                    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
                    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
                    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
                    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
                    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
                    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
                    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
                    	at scala.Option.foreach(Option.scala:407)
                    	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
                    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
                    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
                    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
                    	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
                    	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
                    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
                    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
                    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
                    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
                    	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
                    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
                    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
                    	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
                    	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
                    	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
                    	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)
                    	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
                    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
                    	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
                    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
                    	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
                    	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
                    	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
                    	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516)
                    	at sun.reflect.GeneratedMethodAccessor85.invoke(Unknown Source)
                    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                    	at java.lang.reflect.Method.invoke(Method.java:498)
                    	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
                    	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
                    	at py4j.Gateway.invoke(Gateway.java:282)
                    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
                    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
                    	at py4j.GatewayConnection.run(GatewayConnection.java:238)
                    	at java.lang.Thread.run(Thread.java:748)
                    Caused by: ai.rapids.cudf.CudfException: cuDF failure at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-435-cuda11/cpp/include/cudf/detail/utilities/cuda.cuh:65: num_blocks must be > 0
                    	at ai.rapids.cudf.Table.conditionalLeftJoinGatherMaps(Native Method)
                    	at ai.rapids.cudf.Table.conditionalLeftJoinGatherMaps(Table.java:2176)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.$anonfun$computeGatherMaps$6(GpuBroadcastNestedLoopJoinExec.scala:270)
                    	at scala.Option.getOrElse(Option.scala:189)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.computeGatherMaps(GpuBroadcastNestedLoopJoinExec.scala:270)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.$anonfun$createGatherer$3(GpuBroadcastNestedLoopJoinExec.scala:242)
                    	at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:87)
                    	at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:85)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.closeOnExcept(AbstractGpuJoinIterator.scala:36)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.$anonfun$createGatherer$2(GpuBroadcastNestedLoopJoinExec.scala:236)
                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:36)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.$anonfun$createGatherer$1(GpuBroadcastNestedLoopJoinExec.scala:235)
                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:36)
                    	at org.apache.spark.sql.rapids.execution.ConditionalNestedLoopJoinIterator.createGatherer(GpuBroadcastNestedLoopJoinExec.scala:234)
                    	at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$4(AbstractGpuJoinIterator.scala:221)
                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:36)
                    	at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:220)
                    	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:80)
                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:223)
                    	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
                    	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:178)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:222)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:199)
                    	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:239)
                    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
                    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
                    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
                    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
                    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
                    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
                    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
                    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
                    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
                    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
                    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
                    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
                    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    	... 1 more
 
/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py:326: Py4JJavaError
 ----------------------------- Captured stdout call -----------------------------
 ### CPU RUN ###
 ### GPU RUN ###

Environment details (please complete the following information)

  • Environment location: EGX06/Standalone/UCX
@NvTimLiu NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 30, 2021
@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Aug 30, 2021

There is a cuDF issue rapidsai/cudf#9044

@jlowe jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Aug 30, 2021
@jlowe jlowe self-assigned this Aug 30, 2021
@jlowe jlowe added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Aug 30, 2021
@jlowe
Copy link
Member

jlowe commented Aug 30, 2021

This should be resolved completely by the cudf-side bugfix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants