Add support for arrays in hashaggregate [databricks] #6066

razajafri · 2022-07-22T23:50:56Z

This PR enables HashAggregate for Arrays.

Changed GpuOverrides to remove the check for Arrays
Added tests

fixes #4656

Signed-off-by: Raza Jafri [email protected]

Signed-off-by: Raza Jafri <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

sameerz · 2022-08-02T00:27:24Z

@razajafri when you get a chance can this be retargeted to 22.10 please?

Signed-off-by: Raza Jafri <[email protected]>

jlowe · 2022-08-22T20:26:53Z

integration_tests/src/main/python/hash_aggregate_test.py

@@ -335,7 +339,8 @@ def test_hash_reduction_decimal_overflow_sum(precision):
        # some optimizations are conspiring against us.
        conf = {'spark.rapids.sql.batchSizeBytes': '128m'})

-@pytest.mark.parametrize('data_gen', [_longs_with_nulls], ids=idfn)
+@allow_non_gpu("ShuffleExchangeExec")


I don't understand why this is here. Do we not support shuffling a particular datatype yet support aggregations on it?

Thanks for reviewing. I have added Arrays to ShuffleExchangeExec.

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-08-23T22:50:39Z

build

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-08-24T21:09:30Z

build

jlowe · 2022-08-24T21:12:18Z

integration_tests/src/main/python/hash_aggregate_test.py

+def _grpkey_list_with_non_nested_children():
+    return [[('a', RepeatSeqGen(ArrayGen(data_gen), length=3)),
+    ('b', IntegerGen())] for data_gen in all_basic_gens + decimal_gens]


Why is this a function? It takes no parameters and isn't passed around as a function.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/aggregate.scala

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-08-24T23:13:47Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-08-25T17:07:08Z

build

abellina · 2022-08-25T17:19:51Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

@@ -3730,7 +3728,11 @@ object GpuOverrides extends Logging {
      // This needs to match what murmur3 supports.
      PartChecks(RepeatingParamCheck("hash_key",
        (TypeSig.commonCudfTypes + TypeSig.NULL + TypeSig.DECIMAL_128 +
-            TypeSig.STRUCT).nested(), TypeSig.all)),
+            TypeSig.STRUCT).nested() +
+            TypeSig.ARRAY.nested(


I am not an expert on TypeSig, it just seems a bit weird that we are calling nested() and the nested(child type) for ARRAY. Why do we need to call both?

I am not an expert on TypeSig either and this was my misunderstanding. I thought the nested children are kept in separate buckets under parents, but that's not true. We hold the initialTypes and childTypes as flat lists. I will fix this to avoid the confusion.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Signed-off-by: Raza Jafri <[email protected]>

jlowe · 2022-08-29T18:18:29Z

integration_tests/src/main/python/hash_aggregate_test.py

 def test_hash_grpby_sum_count_action(data_gen):
    assert_gpu_and_cpu_row_counts_equal(
        lambda spark: gen_df(spark, data_gen, length=100).groupby('a').agg(f.sum('b'))
    )

+@allow_non_gpu("ShuffleExchangeExec", "HashAggregateExec")
+@pytest.mark.parametrize('data_gen', [_grpkey_nested_structs_with_array_child], ids=idfn)
+def test_hash_grpby_sum_count_action_fallback(data_gen):


The commit comment says this is for testing shuffle exec fallback, but that's not really what this does. This is more about testing hash aggregate fallback (and arguably is a duplicate of the existing test_hash_agg_with_struct_of_array_fallback). The shuffle is falling back because both sides are also falling back, and it is inefficient to shuffle on the GPU to/from CPU exec nodes.

To have a test more focused on shuffle, it should be in repart_test.py and use something like repartition to force a shuffle. See other tests in repart_test.py for examples. Make sure the test passes with supported types and falls back for unsupported types.

So Spark is deciding not to Shuffle this. I realized something was off about this right after pushing this. I am not seeing it hit the wrapPart method which was confusing me, thank you for clarifying as to why it's not replacing the ShuffleExec when it should.

thank you for clarifying as to why it's not replacing the ShuffleExec when it should.

But that's just it -- it should not replace the shuffle even if it wanted to. Shuffling with arrays-of-struct as the partitioning key should not be supported by ShuffleExec just like we cannot support grouping with arrays-of-struct. I was wrong above, repartition is not what we want here because that doesn't involve partitioning on any particular column. We need to test partitioning on array-of-struct keys which s not supported. Not sure we can easily test this at the integration test level that we're doing this properly, as I think we would need some way for Spark to hash partition on array-of-struct into GPU operations that support that but the shuffle does not. But if we can do an operation that needs hash partitioning of array-of-struct (like groupby) then we should be able to hash partition it as well.

I have added an integration test. I modified a repart_test by adding Array of structs as a type and then just refactored that into a separate test. I know you mentioned that repartition won't work but I wanted to run this by you to see if this does

Signed-off-by: Raza Jafri <[email protected]>

This reverts commit f26c0ff. Signed-off-by: Raza Jafri <[email protected]>

integration_tests/src/main/python/repart_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Signed-off-by: Raza Jafri <[email protected]>

jlowe · 2022-08-29T22:27:52Z

build

razajafri · 2022-08-29T22:48:27Z

build

razajafri · 2022-08-30T03:40:32Z

CI is failing on DB because ShuffleExchangeExec isn't found in the plan. Could be because DB is optimizing the plan differently from vanilla Spark. I will see if I can write another test for DB and skip this test for it.

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-08-31T20:14:27Z

build

This reverts commit b6bd34b. Signed-off-by: Raza Jafri <[email protected]>

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-08-31T23:31:43Z

build

razajafri · 2022-09-01T17:01:09Z

@jlowe can you do the honor one more time?

…6066)" This reverts commit 122e107. Signed-off-by: Raza Jafri <[email protected]>

…#6679) This reverts commit 122e107. Signed-off-by: Raza Jafri <[email protected]> Signed-off-by: Raza Jafri <[email protected]> Co-authored-by: Raza Jafri <[email protected]>

…6066)" (NVIDIA#6679) This reverts commit 122e107. Signed-off-by: Raza Jafri <[email protected]> Signed-off-by: Raza Jafri <[email protected]> Co-authored-by: Raza Jafri <[email protected]>

…VIDIA#6066)" (NVIDIA#6679)" This reverts commit c05ac2d. Signed-off-by: Raza Jafri <[email protected]>

* Revert "Revert "Add support for arrays in hashaggregate [databricks] (#6066)" (#6679)" This reverts commit c05ac2d and adds tests * Add test for aggregation on array * updated docs --------- Signed-off-by: Raza Jafri <[email protected]>

add support for arrays in hashaggregate

e9f925a

Signed-off-by: Raza Jafri <[email protected]>

sameerz added the feature request New feature or request label Jul 25, 2022

sameerz requested review from jlowe, abellina and mythrocks July 25, 2022 16:50

jlowe reviewed Jul 25, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

razajafri changed the base branch from branch-22.08 to branch-22.10 August 2, 2022 00:33

razajafri added 3 commits August 17, 2022 12:17

upmerged

8d89dc9

Signed-off-by: Raza Jafri <[email protected]>

Merge remote-tracking branch 'origin/branch-22.10' into SP-4656

a5ae1c2

Signed-off-by: Raza Jafri <[email protected]>

allow shuffle exchange on CPU

12d97e7

Signed-off-by: Raza Jafri <[email protected]>

razajafri marked this pull request as ready for review August 22, 2022 20:05

razajafri changed the title ~~Add support for arrays in hashaggregate~~ Add support for arrays in hashaggregate [databricks] Aug 22, 2022

razajafri requested a review from jlowe August 22, 2022 20:06

jlowe reviewed Aug 22, 2022

View reviewed changes

Opened up ShuffleExchangeExec for Arrays

dc2da0b

Signed-off-by: Raza Jafri <[email protected]>

razajafri requested a review from jlowe August 23, 2022 22:49

Explicitly remove list of structs from allowed list

4423828

Signed-off-by: Raza Jafri <[email protected]>

jlowe reviewed Aug 24, 2022

View reviewed changes

addressed review comments

44b83f3

Signed-off-by: Raza Jafri <[email protected]>

jlowe reviewed Aug 25, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Show resolved Hide resolved

added ps note for arrays

6476470

Signed-off-by: Raza Jafri <[email protected]>

abellina reviewed Aug 25, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Show resolved Hide resolved

razajafri mentioned this pull request Aug 25, 2022

Fix tests to remove "KnownFloatingPointNormalized" from falling back to the CPU #6415

Closed

simplified GpuOverrides for hash_key

aa59e6a

Signed-off-by: Raza Jafri <[email protected]>

jlowe reviewed Aug 29, 2022

View reviewed changes

razajafri added 2 commits August 29, 2022 12:44

Added a ShuffleExec fallback test

c4c7437

Signed-off-by: Raza Jafri <[email protected]>

Revert "Added a ShuffleExec fallback test"

90082fe

This reverts commit f26c0ff. Signed-off-by: Raza Jafri <[email protected]>

jlowe reviewed Aug 29, 2022

View reviewed changes

integration_tests/src/main/python/repart_test.py Outdated Show resolved Hide resolved

jlowe reviewed Aug 29, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

removed f.hash from the test

b9cc96c

Signed-off-by: Raza Jafri <[email protected]>

jlowe previously approved these changes Aug 29, 2022

View reviewed changes

razajafri added 2 commits August 31, 2022 12:47

Merge remote-tracking branch 'origin/branch-22.10' into SP-4656

b1f367b

Signed-off-by: Raza Jafri <[email protected]>

remove test for a single partition

b6bd34b

Signed-off-by: Raza Jafri <[email protected]>

razajafri dismissed jlowe’s stale review via b6bd34b August 31, 2022 20:14

razajafri added 2 commits August 31, 2022 16:19

Revert "remove test for a single partition"

38e5c4c

This reverts commit b6bd34b. Signed-off-by: Raza Jafri <[email protected]>

Removed test for a single partition

235a791

Signed-off-by: Raza Jafri <[email protected]>

razajafri requested a review from jlowe September 1, 2022 17:01

jlowe approved these changes Sep 1, 2022

View reviewed changes

razajafri merged commit 122e107 into NVIDIA:branch-22.10 Sep 1, 2022

razajafri deleted the SP-4656 branch September 1, 2022 18:31

This was referenced Sep 1, 2022

Remove KnownFloatingPointNormalized from allow_non_gpu #6479

Merged

[FEA] Support GroupBy Array[INT] #5096

Closed

razajafri mentioned this pull request Sep 23, 2022

[FEA] Hash partitioning on ArrayType #4887

Closed

razajafri added a commit to razajafri/spark-rapids that referenced this pull request Oct 3, 2022

Revert "Add support for arrays in hashaggregate [databricks] (NVIDIA#…

839be56

…6066)" This reverts commit 122e107. Signed-off-by: Raza Jafri <[email protected]>

razajafri added a commit to razajafri/spark-rapids that referenced this pull request Jan 6, 2023

Revert "Revert "Add support for arrays in hashaggregate [databricks] (N…

6c6df93

…VIDIA#6066)" (NVIDIA#6679)" This reverts commit c05ac2d. Signed-off-by: Raza Jafri <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for arrays in hashaggregate [databricks] #6066

Add support for arrays in hashaggregate [databricks] #6066

razajafri commented Jul 22, 2022 •

edited

Loading

sameerz commented Aug 2, 2022

jlowe Aug 22, 2022

razajafri Aug 23, 2022

razajafri commented Aug 23, 2022

razajafri commented Aug 24, 2022

jlowe Aug 24, 2022

razajafri commented Aug 24, 2022

razajafri commented Aug 25, 2022

abellina Aug 25, 2022

razajafri Aug 25, 2022

jlowe Aug 29, 2022

razajafri Aug 29, 2022

jlowe Aug 29, 2022

razajafri Aug 29, 2022

jlowe commented Aug 29, 2022

razajafri commented Aug 29, 2022

razajafri commented Aug 30, 2022

razajafri commented Aug 31, 2022

razajafri commented Aug 31, 2022

razajafri commented Sep 1, 2022

Add support for arrays in hashaggregate [databricks] #6066

Add support for arrays in hashaggregate [databricks] #6066

Conversation

razajafri commented Jul 22, 2022 • edited Loading

sameerz commented Aug 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Aug 23, 2022

razajafri commented Aug 24, 2022

Choose a reason for hiding this comment

razajafri commented Aug 24, 2022

razajafri commented Aug 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlowe commented Aug 29, 2022

razajafri commented Aug 29, 2022

razajafri commented Aug 30, 2022

razajafri commented Aug 31, 2022

razajafri commented Aug 31, 2022

razajafri commented Sep 1, 2022

razajafri commented Jul 22, 2022 •

edited

Loading