[BUG] Databricks test fails test_hash_groupby_collect_partial_replace_fallback #3339

tgravescs · 2021-08-30T15:24:02Z

10:05:47 FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_groupby_collect_partial_replace_fallback[false-false-{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.hasNans': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.hashAgg.replaceMode': 'partial'}-[('a', RepeatSeq(Long)), ('b', RepeatSeq(Boolean)), ('c', LongRange(not_null))]][IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,SortExec,SortArray,Alias,Literal,Count,CollectList,CollectSet,GpuToCpuCollectBufferTransition,CpuToGpuCollectBufferTransition,AggregateExpression)]
and alot of others of that same test.

E                   py4j.protocol.Py4JJavaError: An error occurred while calling z:com.nvidia.spark.rapids.ExecutionPlanCaptureCallback.assertContains.
10:05:47  E                   : java.lang.AssertionError: assertion failed: Could not find GpuCollectList in the Spark plan
10:05:47  E                   ObjectHashAggregate(keys=[a#1739656L], functions=[collect_list(b#1739657, 0, 0), collect_set(b#1739657, 0, 0)], output=[a#1739656L, sort_array(collect_list(b), true)#1739667, sort_array(collect_set(b), true)#1739668])
10:05:47  E                   +- GpuColumnarToRow false
10:05:47  E                      +- GpuShuffleCoalesce 2147483647
10:05:47  E                         +- GpuCustomShuffleReader coalesced
10:05:47  E                            +- ShuffleQueryStage 0, Statistics(sizeInBytes=4.3 KiB, rowCount=100, isRuntime=true)
10:05:47  E                               +- GpuColumnarExchange gpuhashpartitioning(a#1739656L, 12), true, [id=#218025]
10:05:47  E                                  +- GpuProject [a#1739656L, b#1739657]
10:05:47  E                                     +- GpuProject [a#1739656L, b#1739657]
10:05:47  E                                        +- GpuRowToColumnar targetsize(2147483647)
10:05:47  E                                           +- *(1) Scan ExistingRDD[a#1739656L,b#1739657,c#1739658L]
10:05:47  E                   
10:05:47  E                   	at scala.Predef$.assert(Predef.scala:223)
10:05:47  E                   	at com.nvidia.spark.rapids.ExecutionPlanCaptureCallback$.assertContains(Plugin.scala:336)
10:05:47  E                   	at com.nvidia.spark.rapids.ExecutionPlanCaptureCallback$.assertContains(Plugin.scala:341)
10:05:47  E                   	at com.nvidia.spark.rapids.ExecutionPlanCaptureCallback.assertContains(Plugin.scala)
10:05:47  E                   	at sun.reflect.GeneratedMethodAccessor472.invoke(Unknown Source)
10:05:47  E                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
10:05:47  E                   	at java.lang.reflect.Method.invoke(Method.java:498)
10:05:47  E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
10:05:47  E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
10:05:47  E                   	at py4j.Gateway.invoke(Gateway.java:295)
10:05:47  E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
10:05:47  E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
10:05:47  E                   	at py4j.GatewayConnection.run(GatewayConnection.java:251)
10:05:47  E                   	at java.lang.Thread.run(Thread.java:748)
10:05:47

The text was updated successfully, but these errors were encountered:

tgravescs · 2021-08-30T15:24:40Z

Perhaps related to #3299
@sperlingxx

sperlingxx · 2021-09-01T10:34:08Z

With some investigation, I found the test specifically failed on non-distinct aggregation.

    assert_cpu_and_gpu_are_equal_collect_with_capture(
        lambda spark: gen_df(spark, data_gen, length=100)
            .groupby('a')
            .agg(f.sort_array(f.collect_list('b')), f.sort_array(f.collect_set('b'))),
        exist_classes='CollectList,CollectSet,GpuCollectList,GpuCollectSet',
        conf=local_conf)

It failed because DB runtime eliminated the map-side combine of map-reduce aggregation. Therefore, the physical plan only contains the reduce-side AggregateExec. Meanwhile, the test here assumes that there exist two AggregateExecs: one on CPU, another on GPU.
I am trying to fix the tests, but I found we need to rework the tagForReplaceMode and corresponding python tests to adapt DB runtime, especially on cases for distinct aggregation. I am working on it.

tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Aug 30, 2021

tgravescs changed the title ~~[BUG] Databricks build fails test_hash_groupby_collect_partial_replace_fallback~~ [BUG] Databricks test fails test_hash_groupby_collect_partial_replace_fallback Aug 30, 2021

tgravescs assigned tgravescs and sperlingxx and unassigned tgravescs Aug 31, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 31, 2021

sperlingxx mentioned this issue Sep 2, 2021

Extend TagForReplaceMode to adapt Databricks runtime #3368

Merged

tgravescs closed this as completed in #3368 Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Databricks test fails test_hash_groupby_collect_partial_replace_fallback #3339

[BUG] Databricks test fails test_hash_groupby_collect_partial_replace_fallback #3339

tgravescs commented Aug 30, 2021

tgravescs commented Aug 30, 2021

sperlingxx commented Sep 1, 2021 •

edited

Loading

[BUG] Databricks test fails test_hash_groupby_collect_partial_replace_fallback #3339

[BUG] Databricks test fails test_hash_groupby_collect_partial_replace_fallback #3339

Comments

tgravescs commented Aug 30, 2021

tgravescs commented Aug 30, 2021

sperlingxx commented Sep 1, 2021 • edited Loading

sperlingxx commented Sep 1, 2021 •

edited

Loading