[BUG] java gateway crashed due to hash_aggregate_test case intermittently #7092

pxLi · 2022-11-17T02:39:59Z

Describe the bug

= 5013 failed, 7103 passed, 692 skipped, 927 xfailed, 14 xpassed, 762 warnings, 3127 errors in 5172.58s (1:26:12) =

failed to connect to java gateway

The java gateway could crash intermittently (60%+) and fail all following cases belong to this gw

One of the failed run print out debug message，

[2022-11-16T10:16:49.602Z] 22/11/16 10:16:49 ERROR Executor: Exception in task 3.0 in stage 5387.0 (TID 18897)^M
[2022-11-16T10:16:49.602Z] ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered^M
[2022-11-16T10:16:49.602Z]      at ai.rapids.cudf.ColumnView.listSortRows(Native Method)^M
[2022-11-16T10:16:49.602Z]      at ai.rapids.cudf.ColumnView.listSortRows(ColumnView.java:3582)^M
[2022-11-16T10:16:49.602Z]      at org.apache.spark.sql.rapids.GpuSortArray.doColumnar(collectionOperations.scala:410)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuBinaryExpression.$anonfun$columnarEval$3(GpuExpressions.scala:261)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResourceIfAllowed(Arm.scala:73)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResourceIfAllowed$(Arm.scala:71)^M
[2022-11-16T10:16:49.602Z]      at org.apache.spark.sql.rapids.GpuSortArray.withResourceIfAllowed(collectionOperations.scala:371)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuBinaryExpression.$anonfun$columnarEval$2(GpuExpressions.scala:254)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResourceIfAllowed(Arm.scala:73)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResourceIfAllowed$(Arm.scala:71)^M
[2022-11-16T10:16:49.602Z]      at org.apache.spark.sql.rapids.GpuSortArray.withResourceIfAllowed(collectionOperations.scala:371)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval(GpuExpressions.scala:253)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval$(GpuExpressions.scala:252)^M
[2022-11-16T10:16:49.602Z]      at org.apache.spark.sql.rapids.GpuSortArray.columnarEval(collectionOperations.scala:371)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuAlias.columnarEval(namedExpressions.scala:109)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuExpressionsUtils$.columnarEvalToColumn(GpuExpressions.scala:94)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuProjectExec$.projectSingle(basicPhysicalOperators.scala:108)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:115)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:216)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:213)^M
[2022-11-16T10:16:49.602Z]      at scala.collection.immutable.List.foreach(List.scala:392)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:213)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:248)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:115)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuProjectExec$.projectAndClose(basicPhysicalOperators.scala:73)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$finalProjectBatch$1(aggregate.scala:520)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.withResource(aggregate.scala:182)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.finalProjectBatch(aggregate.scala:512)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:264)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:182)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:241)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:187)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:238)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:215)^M
[2022-11-16T10:16:49.603Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:255)^M
[2022-11-16T10:16:49.603Z]      at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:178)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)^M
[2022-11-16T10:16:49.603Z]      at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)^M
[2022-11-16T10:16:49.603Z]      at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)^M
[2022-11-16T10:16:49.603Z]      at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.Task.run(Task.scala:91)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:819)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1657)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:822)^M
[2022-11-16T10:16:49.603Z]      at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)^M
[2022-11-16T10:16:49.603Z]      at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:678)^M
[2022-11-16T10:16:49.603Z]      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)^M
[2022-11-16T10:16:49.603Z]      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)^M
[2022-11-16T10:16:49.603Z]      at java.lang.Thread.run(Thread.java:748)^M
[2022-11-16T10:16:49.603Z] 22/11/16 10:16:49 ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA error: ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered^M


ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA error       : 
ai.rapids.cudf.CudaFatalException: std::bad_alloc: 
CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-72-cuda11/thirdparty/cudf/cpp/build/_deps/
rmm-src/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:121: 
cudaErrorIllegalAddress an illegal memory access was encountered

seems mostly due some case (mostly started fail at different hash_aggregate_test cases) in gateway fail to get more gpu memory and shutdown the gateway.
This could also be related to recent cudf change

Steps/Code to reproduce bug
the failure was found in databricks IT in parallel (parallelism as 4) w/ tesla V100 (16GB)

I also saw this in normal spark IT in parallel runs (parallelism as 4) w/ TITAN V gpu (12GB) less frequently

integration_tests/run_pyspark_from_build.sh -k hash_aggregate_test

The text was updated successfully, but these errors were encountered:

pxLi · 2022-11-17T03:27:52Z

FYI recent failed run are based on JNI built w/ rapidsai/cudf@c574ddf

Signed-off-by: Peixin Li <[email protected]>

pxLi · 2022-11-18T01:26:30Z

Confirmed this could be reproducible locally w/ smaller parallelism (spark 312 + TITAN V 12GB in my dev machine)
TEST_PARALLEL=<3 or 2> integration_tests/run_pyspark_from_build.sh -k hash_aggregate_test
even in case that parallelism=2, would throw

ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
...
ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA error: ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
...

The issue also cause hash_aggregate_test hanging forever in xdist run

Attached the log here
issue7092_executor4.log

This reverts commit c67927b.

… for #7092 [databricks] (#7102) * Temporarily lower parallelism for #7092 Signed-off-by: Peixin Li <[email protected]> * Revert "Temporarily lower parallelism for #7092" This reverts commit c67927b. * XFAIL test_hash_groupby_collect_partial_replace_with_distinct_fallback Signed-off-by: Peixin Li <[email protected]> * skip the case instead of xfail * xfail new case failure Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]>

pxLi · 2022-11-18T08:08:39Z

also intermittently failed test_hash_groupby_collect_with_single_distinct in https://github.com/NVIDIA/spark-rapids/actions/runs/3495199975/jobs/5851695883

Note: this failure is from 23.02 branch, the cudf commit may not be the same as 22.12 one

pxLi · 2022-11-18T08:21:58Z

We will also keep monitoring this after new cudf commits get in

jlowe · 2022-11-18T16:10:43Z

I was able to hit an error in test_hash_groupby_collect_partial_replace_with_distinct_fallback under compute_sanitizer:

========= Invalid __global__ read of size 1 bytes
=========     at 0x590 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<const bool *, cudf::detail::input_indexalator>, bool *, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<const bool *, cudf::detail::input_indexalator>, bool *, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>(T2, T3)
=========     by thread (33,0,0) in block (0,0,0)
=========     Address 0x7f225ee5526b is out of bounds
=========     and is 52080021 bytes before the nearest allocation at 0x7f2262000000 of size 1610612736 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel_ptsz [0x2d53e6]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e16c1b]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0x3e54368]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:void thrust::cuda_cub::parallel_for<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_nosync_base>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<bool const*, cudf::detail::input_indexalator>, bool*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_nosync_base> >&, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<bool const*, cudf::detail::input_indexalator>, bool*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long) [0x17cde82]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:void cudf::detail::gather_helper<bool const*, bool*, cudf::detail::input_indexalator>(bool const*, int, bool*, cudf::detail::input_indexalator, cudf::detail::input_indexalator, bool, rmm::cuda_stream_view) [0x17ce262]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<cudf::detail::input_indexalator>(cudf::table_view const&, cudf::detail::input_indexalator, cudf::detail::input_indexalator, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x17dc79e]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::detail::gather(cudf::table_view const&, cudf::column_view const&, cudf::out_of_bounds_policy, cudf::detail::negative_index_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x17a4171]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::detail::(anonymous namespace)::segmented_sort_by_key_common(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, cudf::detail::(anonymous namespace)::sort_method, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a13c84]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a14778]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f97f6a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::mr::device_memory_resource*) [0x1f98c6a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:Java_ai_rapids_cudf_ColumnView_listSortRows [0x14a4478]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0x769abc97]
=========                in 
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4083e3]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e51538]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::(anonymous namespace)::build_output_offsets(cudf::lists_column_view const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f9778c]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f9796a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::mr::device_memory_resource*) [0x1f98c6a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:Java_ai_rapids_cudf_ColumnView_listSortRows [0x14a4478]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0xffffffffe96f1476]
=========                in 
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4083e3]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e4e374]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::(anonymous namespace)::build_output_offsets(cudf::lists_column_view const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f97791]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f9796a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::mr::device_memory_resource*) [0x1f98c6a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:Java_ai_rapids_cudf_ColumnView_listSortRows [0x14a4478]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0xffffffffe96f1476]
=========                in 
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4083e3]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e5654e]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0x114f057]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0xffffffffe96f1476]
=========                in 
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4083e3]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e4bbe7]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0x114f062]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0xffffffffe96f1476]
=========                in 
========= 
22/11/18 15:45:35 ERROR Executor: Exception in task 3.0 in stage 635.0 (TID 1895)
ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
        at ai.rapids.cudf.ColumnView.listSortRows(Native Method)
        at ai.rapids.cudf.ColumnView.listSortRows(ColumnView.java:3582)
        at org.apache.spark.sql.rapids.GpuSortArray.doColumnar(collectionOperations.scala:410)
        at com.nvidia.spark.rapids.GpuBinaryExpression.$anonfun$columnarEval$3(GpuExpressions.scala:261)
        at com.nvidia.spark.rapids.Arm.withResourceIfAllowed(Arm.scala:73)
        at com.nvidia.spark.rapids.Arm.withResourceIfAllowed$(Arm.scala:71)
        at org.apache.spark.sql.rapids.GpuSortArray.withResourceIfAllowed(collectionOperations.scala:371)
        at com.nvidia.spark.rapids.GpuBinaryExpression.$anonfun$columnarEval$2(GpuExpressions.scala:254)
        at com.nvidia.spark.rapids.Arm.withResourceIfAllowed(Arm.scala:73)
        at com.nvidia.spark.rapids.Arm.withResourceIfAllowed$(Arm.scala:71)
        at org.apache.spark.sql.rapids.GpuSortArray.withResourceIfAllowed(collectionOperations.scala:371)
        at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval(GpuExpressions.scala:253)
        at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval$(GpuExpressions.scala:252)
        at org.apache.spark.sql.rapids.GpuSortArray.columnarEval(collectionOperations.scala:371)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)
        at com.nvidia.spark.rapids.GpuAlias.columnarEval(namedExpressions.scala:109)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)
        at com.nvidia.spark.rapids.GpuExpressionsUtils$.columnarEvalToColumn(GpuExpressions.scala:94)
        at com.nvidia.spark.rapids.GpuProjectExec$.projectSingle(basicPhysicalOperators.scala:108)
        at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:115)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:216)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:213)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:213)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:248)
        at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:115)
        at com.nvidia.spark.rapids.GpuProjectExec$.projectAndClose(basicPhysicalOperators.scala:73)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$finalProjectBatch$1(aggregate.scala:520)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.withResource(aggregate.scala:182)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.finalProjectBatch(aggregate.scala:512)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:264)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:182)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:241)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:187)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:238)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:215)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:255)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

…fallback for NVIDIA#7092 [databricks] (NVIDIA#7102)" This reverts commit c70bcfb. Signed-off-by: Jason Lowe <[email protected]>

#7142) * Revert "Skip test_hash_groupby_collect_with_single_distinct (#7107)" This reverts commit d5ba2e2. Signed-off-by: Jason Lowe <[email protected]> * Revert "Skip test_hash_groupby_collect_partial_replace_with_distinct_fallback for #7092 [databricks] (#7102)" This reverts commit c70bcfb. Signed-off-by: Jason Lowe <[email protected]> Signed-off-by: Jason Lowe <[email protected]>

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 17, 2022

pxLi mentioned this issue Nov 17, 2022

Update JNI and cudf-py version to 23.02 [databricks] #7074

Merged

pxLi added test Only impacts tests and removed ? - Needs Triage Need team to review and classify labels Nov 17, 2022

pxLi changed the title ~~[BUG] java gateway crashed during databricks parallel IT run intermittently~~ [BUG] java gateway crashed parallel IT run intermittently Nov 17, 2022

pxLi added a commit to pxLi/spark-rapids that referenced this issue Nov 18, 2022

Temporarily lower parallelism for NVIDIA#7092

c67927b

Signed-off-by: Peixin Li <[email protected]>

This was referenced Nov 18, 2022

Skip test_hash_groupby_collect_partial_replace_with_distinct_fallback for #7092 [databricks] #7102

Merged

[WIP] - Databricks 11.3 Runtime - compile only #7100

Closed

pxLi added a commit to pxLi/spark-rapids that referenced this issue Nov 18, 2022

Revert "Temporarily lower parallelism for NVIDIA#7092"

7d8efa2

This reverts commit c67927b.

pxLi changed the title ~~[BUG] java gateway crashed parallel IT run intermittently~~ [BUG] java gateway crashed due to hash_aggregate_test case intermittently Nov 18, 2022

pxLi mentioned this issue Nov 18, 2022

Skip test_hash_groupby_collect_with_single_distinct [skip ci] #7107

Merged

jlowe self-assigned this Nov 18, 2022

jlowe mentioned this issue Nov 18, 2022

[BUG] Illegal memory access while sorting lists rapidsai/cudf#12201

Closed

sameerz mentioned this issue Nov 21, 2022

[BUG] CPU mismatch GPU result in test_hash_groupby_collect_with_single_distinct intermittently #7104

Closed

jlowe added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed test Only impacts tests labels Nov 21, 2022

jlowe mentioned this issue Nov 22, 2022

Restore hash aggregate tests after cub segmented sort fix [databricks] #7142

Merged

pxLi closed this as completed in #7142 Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] java gateway crashed due to hash_aggregate_test case intermittently #7092

[BUG] java gateway crashed due to hash_aggregate_test case intermittently #7092

pxLi commented Nov 17, 2022 •

edited

Loading

pxLi commented Nov 17, 2022 •

edited

Loading

pxLi commented Nov 18, 2022 •

edited

Loading

pxLi commented Nov 18, 2022 •

edited

Loading

pxLi commented Nov 18, 2022

jlowe commented Nov 18, 2022

[BUG] java gateway crashed due to hash_aggregate_test case intermittently #7092

[BUG] java gateway crashed due to hash_aggregate_test case intermittently #7092

Comments

pxLi commented Nov 17, 2022 • edited Loading

pxLi commented Nov 17, 2022 • edited Loading

pxLi commented Nov 18, 2022 • edited Loading

pxLi commented Nov 18, 2022 • edited Loading

pxLi commented Nov 18, 2022

jlowe commented Nov 18, 2022

pxLi commented Nov 17, 2022 •

edited

Loading

pxLi commented Nov 17, 2022 •

edited

Loading

pxLi commented Nov 18, 2022 •

edited

Loading

pxLi commented Nov 18, 2022 •

edited

Loading