[WIP] Support GpuCollectList/GpuCollectSet in groupBy aggregation #2804

sperlingxx · 2021-06-24T11:06:39Z

Current PR is to support GpuCollectList and GpuCollectSet in groupBy aggregation.

CollectList and CollectSet are the first two TypedImperativeAggregate, which we attempt to provide GPU support. Unlike DeclativeAggregate and other ImperativeAggregate (such as PivotFirst), TypedImperativeAggregate will create aggBufferAttributes of BinaryType (for serialization). Meanwhile, in GPU counterparts, we store aggBufferAttributes just as datatype of update expressions (Array[elementType] for collectOps). Therefore, we involve three additional tasks in this PR:

Replaces Binary typed aggBufferAttributes with corresponding GPU buffers, which is done in GpuObjectHashAggregateMeta.convertToGpu. Although ObjectHashAggregateExec doesn't cover all cases of TypedImperativeAggregate, it covers the most common usages and we can support another situations later (for instance, aggregation contains multiple distinct, which rewritten by optimizer).
Bypasses the type check system through marking TypeSig.BINARY as plugin-supported for ObjectHashAggregateExec, HashPartition and ShuffleExchangeExec. And checks the validity of BinaryTypes in the tagForGpu methods.
Reworked (simplified) GpuHashAggregateExec.setupReferences to make it work with TypedImperativeAggregate in PartialMerge mode.

I labeled this PR as WIP, since I am not quite confident whether the above approaches are appropriate or not.

Signed-off-by: sperlingxx <[email protected]>

…roupby

Signed-off-by: sperlingxx <[email protected]>

ttnghia · 2021-06-24T12:32:01Z

integration_tests/src/main/python/hash_aggregate_test.py

@@ -371,6 +374,107 @@ def test_hash_reduction_pivot_without_nans(data_gen, conf):
            .agg(f.sum('c')),
        conf=conf)

+_repeat_agg_column_for_collect_op = [


Why "hash_aggregate"? In cudf, collect list/set are only sort_aggregages.

@ttnghia This test is not named after how cudf implements something. What is more cudf hides that detail from users so any test outside of cudf itself That is a hidden implementation detail. It is named after how Spark implemented it, and we are trying to get coverage on that operator.

That said CollectList and CollectSet apparently are showing up as ObjectHashAggregateExec and there are also some SortAggregateExec tests in here too. At some point we should either split this up into separate files or rename it.

ttnghia · 2021-06-24T12:36:08Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/AggregateFunctions.scala

+    throw new UnsupportedOperationException("CollectSet is not yet supported in reduction")
+  override lazy val mergeReductionAggregate: cudf.ColumnVector => cudf.Scalar =
+    throw new UnsupportedOperationException("CollectSet is not yet supported in reduction")
+  override lazy val updateAggregate: Aggregation = Aggregation.collectSet()


Humnn, is there any way to call Aggregation.collectList if the data is partitioned into more than one batch? As we only need lists (which may contain duplicates) for the intermediate results. Calling collectSet to generate the intermediate results is expensive, as that involves unnecessarily executing drop_list_duplicates on the temporary lists.

revans2

This is a lot of change and I think we need to look into what is the proper way to deal with ObjectHashAggregate. It feels kind of hacked together and I want to get a real design put in place before we push this in.

revans2 · 2021-06-24T14:35:03Z

integration_tests/src/main/python/hash_aggregate_test.py

@@ -371,6 +374,107 @@ def test_hash_reduction_pivot_without_nans(data_gen, conf):
            .agg(f.sum('c')),
        conf=conf)

+_repeat_agg_column_for_collect_op = [


@ttnghia This test is not named after how cudf implements something. What is more cudf hides that detail from users so any test outside of cudf itself That is a hidden implementation detail. It is named after how Spark implemented it, and we are trying to get coverage on that operator.

That said CollectList and CollectSet apparently are showing up as ObjectHashAggregateExec and there are also some SortAggregateExec tests in here too. At some point we should either split this up into separate files or rename it.

revans2 · 2021-06-24T14:40:26Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

-        (TypeSig.commonCudfTypes + TypeSig.NULL + TypeSig.DECIMAL + TypeSig.STRUCT).nested(),
+        (TypeSig.commonCudfTypes + TypeSig.NULL + TypeSig.DECIMAL + TypeSig.BINARY +
+            TypeSig.STRUCT).nested()
+            .withPsNote(TypeEnum.BINARY, "Marking BINARY as plugin-supported is only to " +


What exactly does this mean and how does this help an end user?

revans2 · 2021-06-24T14:41:42Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

          .withPsNote(TypeEnum.STRUCT, "Round-robin partitioning is not supported for nested " +
              s"structs if ${SQLConf.SORT_BEFORE_REPARTITION.key} is true")
          .withPsNote(TypeEnum.ARRAY, "Round-robin partitioning is not supported if " +
              s"${SQLConf.SORT_BEFORE_REPARTITION.key} is true")
          .withPsNote(TypeEnum.MAP, "Round-robin partitioning is not supported if " +
-              s"${SQLConf.SORT_BEFORE_REPARTITION.key} is true"),
+              s"${SQLConf.SORT_BEFORE_REPARTITION.key} is true")
+          .withPsNote(TypeEnum.BINARY, "Marking BINARY as plugin-supported is only to " +


Why do we need a restriction on this?

revans2 · 2021-06-24T14:42:57Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

@@ -3004,6 +3034,19 @@ object GpuOverrides {
            .withPsNote(TypeEnum.STRUCT, "not allowed for grouping expressions"),
        TypeSig.all),
      (agg, conf, p, r) => new GpuHashAggregateMeta(agg, conf, p, r)),
+    exec[ObjectHashAggregateExec](
+      "The backend for hash based aggregations",


I think we need a clearer description to distinguish it from the regular HashAggregateExec.

revans2 · 2021-06-24T14:44:21Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/aggregate.scala

+
+  def filterNonAggBufBinaryExpressions(expressions: Seq[Expression]): Seq[Expression] = {
+    expressions.filter {
+      case AttributeReference("buf", BinaryType, _, _) => false


What makes "buf" special? This is not good enough.

ttnghia · 2021-06-24T16:38:43Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/AggregateFunctions.scala

    mutableAggBufferOffset: Int = 0,
    inputAggBufferOffset: Int = 0)
    extends GpuCollectBase[CollectSetAggregation] {

+  override lazy val updateExpressions: Seq[GpuExpression] = new CudfCollectSet(inputBuf) :: Nil
+
+  override lazy val mergeExpressions: Seq[GpuExpression] = new CudfMergeSets(outputBuf) :: Nil


I'm thinking of totally removing CudfCollectSet and CudfMergeSet. We just call CudfCollectList in updateExpressions, and CudfMergeLists in mergeExpressions. During evaluateExpression, we call CudfDropListDuplicates on the merged lists.

jlowe · 2021-07-23T20:36:13Z

Is this superceded by #2971?

sperlingxx · 2021-07-26T01:53:53Z

Is this superceded by #2971?

Yes, I closed it.

sperlingxx added 4 commits June 24, 2021 15:41

support GpuGroupByCollect

005237f

Signed-off-by: sperlingxx <[email protected]>

Merge remote-tracking branch 'origin/branch-21.08' into collect_ops_g…

7854511

…roupby

little updates

3ac68e1

Signed-off-by: sperlingxx <[email protected]>

update docs

c5590c2

Signed-off-by: sperlingxx <[email protected]>

sperlingxx requested review from jlowe, abellina, revans2 and firestarman June 24, 2021 11:06

sperlingxx added 2 commits June 24, 2021 19:10

remove test code

c3275ea

Signed-off-by: sperlingxx <[email protected]>

small fix

211ad04

Signed-off-by: sperlingxx <[email protected]>

ttnghia reviewed Jun 24, 2021

View reviewed changes

revans2 reviewed Jun 24, 2021

View reviewed changes

ttnghia reviewed Jun 24, 2021

View reviewed changes

sameerz added the feature request New feature or request label Jun 24, 2021

sameerz mentioned this pull request Jul 18, 2021

[FEA] support collect_list and collect_set as groupby aggregation #2615

Closed

sperlingxx closed this Jul 26, 2021

sperlingxx deleted the collect_ops_groupby branch December 2, 2021 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support GpuCollectList/GpuCollectSet in groupBy aggregation #2804

[WIP] Support GpuCollectList/GpuCollectSet in groupBy aggregation #2804

sperlingxx commented Jun 24, 2021 •

edited

Loading

ttnghia Jun 24, 2021 •

edited

Loading

revans2 Jun 24, 2021

ttnghia Jun 24, 2021 •

edited

Loading

revans2 left a comment

revans2 Jun 24, 2021

revans2 Jun 24, 2021

revans2 Jun 24, 2021

revans2 Jun 24, 2021

revans2 Jun 24, 2021

ttnghia Jun 24, 2021 •

edited

Loading

jlowe commented Jul 23, 2021

sperlingxx commented Jul 26, 2021

[WIP] Support GpuCollectList/GpuCollectSet in groupBy aggregation #2804

[WIP] Support GpuCollectList/GpuCollectSet in groupBy aggregation #2804

Conversation

sperlingxx commented Jun 24, 2021 • edited Loading

ttnghia Jun 24, 2021 • edited Loading

Choose a reason for hiding this comment

revans2 Jun 24, 2021

Choose a reason for hiding this comment

ttnghia Jun 24, 2021 • edited Loading

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

revans2 Jun 24, 2021

Choose a reason for hiding this comment

revans2 Jun 24, 2021

Choose a reason for hiding this comment

revans2 Jun 24, 2021

Choose a reason for hiding this comment

revans2 Jun 24, 2021

Choose a reason for hiding this comment

revans2 Jun 24, 2021

Choose a reason for hiding this comment

ttnghia Jun 24, 2021 • edited Loading

Choose a reason for hiding this comment

jlowe commented Jul 23, 2021

sperlingxx commented Jul 26, 2021

sperlingxx commented Jun 24, 2021 •

edited

Loading

ttnghia Jun 24, 2021 •

edited

Loading

ttnghia Jun 24, 2021 •

edited

Loading

ttnghia Jun 24, 2021 •

edited

Loading