Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] java gateway crashed due to hash_aggregate_test case intermittently #7092

Closed
pxLi opened this issue Nov 17, 2022 · 5 comments · Fixed by #7142
Closed

[BUG] java gateway crashed due to hash_aggregate_test case intermittently #7092

pxLi opened this issue Nov 17, 2022 · 5 comments · Fixed by #7142
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@pxLi
Copy link
Collaborator

pxLi commented Nov 17, 2022

Describe the bug

= 5013 failed, 7103 passed, 692 skipped, 927 xfailed, 14 xpassed, 762 warnings, 3127 errors in 5172.58s (1:26:12) =

failed to connect to java gateway

The java gateway could crash intermittently (60%+) and fail all following cases belong to this gw

One of the failed run print out debug message,

[2022-11-16T10:16:49.602Z] 22/11/16 10:16:49 ERROR Executor: Exception in task 3.0 in stage 5387.0 (TID 18897)^M
[2022-11-16T10:16:49.602Z] ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered^M
[2022-11-16T10:16:49.602Z]      at ai.rapids.cudf.ColumnView.listSortRows(Native Method)^M
[2022-11-16T10:16:49.602Z]      at ai.rapids.cudf.ColumnView.listSortRows(ColumnView.java:3582)^M
[2022-11-16T10:16:49.602Z]      at org.apache.spark.sql.rapids.GpuSortArray.doColumnar(collectionOperations.scala:410)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuBinaryExpression.$anonfun$columnarEval$3(GpuExpressions.scala:261)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResourceIfAllowed(Arm.scala:73)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResourceIfAllowed$(Arm.scala:71)^M
[2022-11-16T10:16:49.602Z]      at org.apache.spark.sql.rapids.GpuSortArray.withResourceIfAllowed(collectionOperations.scala:371)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuBinaryExpression.$anonfun$columnarEval$2(GpuExpressions.scala:254)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResourceIfAllowed(Arm.scala:73)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResourceIfAllowed$(Arm.scala:71)^M
[2022-11-16T10:16:49.602Z]      at org.apache.spark.sql.rapids.GpuSortArray.withResourceIfAllowed(collectionOperations.scala:371)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval(GpuExpressions.scala:253)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval$(GpuExpressions.scala:252)^M
[2022-11-16T10:16:49.602Z]      at org.apache.spark.sql.rapids.GpuSortArray.columnarEval(collectionOperations.scala:371)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuAlias.columnarEval(namedExpressions.scala:109)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuExpressionsUtils$.columnarEvalToColumn(GpuExpressions.scala:94)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuProjectExec$.projectSingle(basicPhysicalOperators.scala:108)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:115)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:216)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:213)^M
[2022-11-16T10:16:49.602Z]      at scala.collection.immutable.List.foreach(List.scala:392)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:213)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:248)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:115)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuProjectExec$.projectAndClose(basicPhysicalOperators.scala:73)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$finalProjectBatch$1(aggregate.scala:520)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.withResource(aggregate.scala:182)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.finalProjectBatch(aggregate.scala:512)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:264)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:182)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:241)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:187)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:238)^M
[2022-11-16T10:16:49.602Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:215)^M
[2022-11-16T10:16:49.603Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:255)^M
[2022-11-16T10:16:49.603Z]      at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:178)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)^M
[2022-11-16T10:16:49.603Z]      at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)^M
[2022-11-16T10:16:49.603Z]      at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)^M
[2022-11-16T10:16:49.603Z]      at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.scheduler.Task.run(Task.scala:91)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:819)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1657)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:822)^M
[2022-11-16T10:16:49.603Z]      at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)^M
[2022-11-16T10:16:49.603Z]      at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)^M
[2022-11-16T10:16:49.603Z]      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:678)^M
[2022-11-16T10:16:49.603Z]      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)^M
[2022-11-16T10:16:49.603Z]      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)^M
[2022-11-16T10:16:49.603Z]      at java.lang.Thread.run(Thread.java:748)^M
[2022-11-16T10:16:49.603Z] 22/11/16 10:16:49 ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA error: ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered^M


ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA error       : 
ai.rapids.cudf.CudaFatalException: std::bad_alloc: 
CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-72-cuda11/thirdparty/cudf/cpp/build/_deps/
rmm-src/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:121: 
cudaErrorIllegalAddress an illegal memory access was encountered

seems mostly due some case (mostly started fail at different hash_aggregate_test cases) in gateway fail to get more gpu memory and shutdown the gateway.
This could also be related to recent cudf change

Steps/Code to reproduce bug
the failure was found in databricks IT in parallel (parallelism as 4) w/ tesla V100 (16GB)

I also saw this in normal spark IT in parallel runs (parallelism as 4) w/ TITAN V gpu (12GB) less frequently

integration_tests/run_pyspark_from_build.sh -k hash_aggregate_test
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 17, 2022
@pxLi pxLi added test Only impacts tests and removed ? - Needs Triage Need team to review and classify labels Nov 17, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented Nov 17, 2022

FYI recent failed run are based on JNI built w/ rapidsai/cudf@c574ddf

@pxLi pxLi changed the title [BUG] java gateway crashed during databricks parallel IT run intermittently [BUG] java gateway crashed parallel IT run intermittently Nov 17, 2022
pxLi added a commit to pxLi/spark-rapids that referenced this issue Nov 18, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented Nov 18, 2022

Confirmed this could be reproducible locally w/ smaller parallelism (spark 312 + TITAN V 12GB in my dev machine)
TEST_PARALLEL=<3 or 2> integration_tests/run_pyspark_from_build.sh -k hash_aggregate_test
even in case that parallelism=2, would throw

ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
...
ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA error: ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
...

The issue also cause hash_aggregate_test hanging forever in xdist run

Attached the log here
issue7092_executor4.log

pxLi added a commit to pxLi/spark-rapids that referenced this issue Nov 18, 2022
@pxLi pxLi changed the title [BUG] java gateway crashed parallel IT run intermittently [BUG] java gateway crashed due to hash_aggregate_test case intermittently Nov 18, 2022
pxLi added a commit that referenced this issue Nov 18, 2022
… for #7092 [databricks] (#7102)

* Temporarily lower parallelism for #7092

Signed-off-by: Peixin Li <[email protected]>

* Revert "Temporarily lower parallelism for #7092"

This reverts commit c67927b.

* XFAIL test_hash_groupby_collect_partial_replace_with_distinct_fallback

Signed-off-by: Peixin Li <[email protected]>

* skip the case instead of xfail

* xfail new case failure

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: Peixin Li <[email protected]>
@pxLi
Copy link
Collaborator Author

pxLi commented Nov 18, 2022

also intermittently failed test_hash_groupby_collect_with_single_distinct in https://github.com/NVIDIA/spark-rapids/actions/runs/3495199975/jobs/5851695883

Note: this failure is from 23.02 branch, the cudf commit may not be the same as 22.12 one

@pxLi
Copy link
Collaborator Author

pxLi commented Nov 18, 2022

We will also keep monitoring this after new cudf commits get in

@jlowe jlowe self-assigned this Nov 18, 2022
@jlowe
Copy link
Member

jlowe commented Nov 18, 2022

I was able to hit an error in test_hash_groupby_collect_partial_replace_with_distinct_fallback under compute_sanitizer:

========= Invalid __global__ read of size 1 bytes
=========     at 0x590 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<const bool *, cudf::detail::input_indexalator>, bool *, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<const bool *, cudf::detail::input_indexalator>, bool *, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>(T2, T3)
=========     by thread (33,0,0) in block (0,0,0)
=========     Address 0x7f225ee5526b is out of bounds
=========     and is 52080021 bytes before the nearest allocation at 0x7f2262000000 of size 1610612736 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel_ptsz [0x2d53e6]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e16c1b]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0x3e54368]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:void thrust::cuda_cub::parallel_for<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_nosync_base>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<bool const*, cudf::detail::input_indexalator>, bool*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_nosync_base> >&, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<bool const*, cudf::detail::input_indexalator>, bool*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long) [0x17cde82]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:void cudf::detail::gather_helper<bool const*, bool*, cudf::detail::input_indexalator>(bool const*, int, bool*, cudf::detail::input_indexalator, cudf::detail::input_indexalator, bool, rmm::cuda_stream_view) [0x17ce262]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<cudf::detail::input_indexalator>(cudf::table_view const&, cudf::detail::input_indexalator, cudf::detail::input_indexalator, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x17dc79e]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::detail::gather(cudf::table_view const&, cudf::column_view const&, cudf::out_of_bounds_policy, cudf::detail::negative_index_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x17a4171]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::detail::(anonymous namespace)::segmented_sort_by_key_common(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, cudf::detail::(anonymous namespace)::sort_method, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a13c84]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a14778]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f97f6a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::mr::device_memory_resource*) [0x1f98c6a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:Java_ai_rapids_cudf_ColumnView_listSortRows [0x14a4478]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0x769abc97]
=========                in 
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4083e3]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e51538]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::(anonymous namespace)::build_output_offsets(cudf::lists_column_view const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f9778c]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f9796a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::mr::device_memory_resource*) [0x1f98c6a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:Java_ai_rapids_cudf_ColumnView_listSortRows [0x14a4478]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0xffffffffe96f1476]
=========                in 
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4083e3]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e4e374]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::(anonymous namespace)::build_output_offsets(cudf::lists_column_view const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f97791]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::detail::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1f9796a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:cudf::lists::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::mr::device_memory_resource*) [0x1f98c6a]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame:Java_ai_rapids_cudf_ColumnView_listSortRows [0x14a4478]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0xffffffffe96f1476]
=========                in 
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4083e3]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e5654e]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0x114f057]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0xffffffffe96f1476]
=========                in 
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4083e3]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e4bbe7]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0x114f062]
=========                in /tmp/cudf1795351332239874013.so
=========     Host Frame: [0xffffffffe96f1476]
=========                in 
========= 
22/11/18 15:45:35 ERROR Executor: Exception in task 3.0 in stage 635.0 (TID 1895)
ai.rapids.cudf.CudaFatalException: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
        at ai.rapids.cudf.ColumnView.listSortRows(Native Method)
        at ai.rapids.cudf.ColumnView.listSortRows(ColumnView.java:3582)
        at org.apache.spark.sql.rapids.GpuSortArray.doColumnar(collectionOperations.scala:410)
        at com.nvidia.spark.rapids.GpuBinaryExpression.$anonfun$columnarEval$3(GpuExpressions.scala:261)
        at com.nvidia.spark.rapids.Arm.withResourceIfAllowed(Arm.scala:73)
        at com.nvidia.spark.rapids.Arm.withResourceIfAllowed$(Arm.scala:71)
        at org.apache.spark.sql.rapids.GpuSortArray.withResourceIfAllowed(collectionOperations.scala:371)
        at com.nvidia.spark.rapids.GpuBinaryExpression.$anonfun$columnarEval$2(GpuExpressions.scala:254)
        at com.nvidia.spark.rapids.Arm.withResourceIfAllowed(Arm.scala:73)
        at com.nvidia.spark.rapids.Arm.withResourceIfAllowed$(Arm.scala:71)
        at org.apache.spark.sql.rapids.GpuSortArray.withResourceIfAllowed(collectionOperations.scala:371)
        at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval(GpuExpressions.scala:253)
        at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval$(GpuExpressions.scala:252)
        at org.apache.spark.sql.rapids.GpuSortArray.columnarEval(collectionOperations.scala:371)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)
        at com.nvidia.spark.rapids.GpuAlias.columnarEval(namedExpressions.scala:109)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)
        at com.nvidia.spark.rapids.GpuExpressionsUtils$.columnarEvalToColumn(GpuExpressions.scala:94)
        at com.nvidia.spark.rapids.GpuProjectExec$.projectSingle(basicPhysicalOperators.scala:108)
        at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:115)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:216)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:213)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:213)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:248)
        at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:115)
        at com.nvidia.spark.rapids.GpuProjectExec$.projectAndClose(basicPhysicalOperators.scala:73)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$finalProjectBatch$1(aggregate.scala:520)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.withResource(aggregate.scala:182)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.finalProjectBatch(aggregate.scala:512)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:264)
        at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:182)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:241)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:187)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:238)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:215)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:255)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

@jlowe jlowe added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed test Only impacts tests labels Nov 21, 2022
jlowe added a commit to jlowe/spark-rapids that referenced this issue Nov 22, 2022
…fallback for NVIDIA#7092 [databricks] (NVIDIA#7102)"

This reverts commit c70bcfb.

Signed-off-by: Jason Lowe <[email protected]>
pxLi pushed a commit that referenced this issue Nov 23, 2022
#7142)

* Revert "Skip test_hash_groupby_collect_with_single_distinct (#7107)"

This reverts commit d5ba2e2.

Signed-off-by: Jason Lowe <[email protected]>

* Revert "Skip test_hash_groupby_collect_partial_replace_with_distinct_fallback for #7092 [databricks] (#7102)"

This reverts commit c70bcfb.

Signed-off-by: Jason Lowe <[email protected]>

Signed-off-by: Jason Lowe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants