[SPARK-2663] [SQL] Support the Grouping Set #1567

chenghao-intel · 2014-07-24T05:20:37Z

Add support for GROUPING SETS, ROLLUP, CUBE and the the virtual column GROUPING__ID.

More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf

The generic idea of the implementations are :
1 Replace the ROLLUP, CUBE with GROUPING SETS
2 Explode each of the input row, and then feed them to Aggregate

Each grouping set are represented as the bit mask for the GroupBy Expression List, for each bit, 1 means the expression is selected, otherwise 0 (left is the lower bit, and right is the higher bit in the GroupBy Expression List)
Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with Literal(null) if it's not selected in the grouping set (based on the bit mask)
Output Schema of Explode is child.output :+ grouping__id
GroupBy Expressions of Aggregate is GroupBy Expression List :+ grouping__id
Keep the Aggregation expressions the same for the Aggregate

The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an Explosive operator added for Physical Plan, which will explode the rows according the pre-set projections.

A known issue will be done in the follow up PR:

Optimization ColumnPruning is not supported yet for Explosive node.

SparkQA · 2014-07-24T05:23:26Z

QA tests have started for PR 1567. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17097/consoleFull

SparkQA · 2014-07-24T05:24:18Z

QA results for PR 1567:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class GroupingSet(bitmasks: Seq[Int],
case class Cube(groupByExprs: Seq[Expression],
case class Rollup(groupByExprs: Seq[Expression],
case class VirtualColumn(name: String, dataType: DataType = StringType, nullable: Boolean = false)
case class GroupingSetExpansion(
case class GroupingSetExpansion(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17097/consoleFull

SparkQA · 2014-07-24T08:18:45Z

QA tests have started for PR 1567. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17114/consoleFull

SparkQA · 2014-07-24T09:58:20Z

QA results for PR 1567:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class GroupingSet(bitmasks: Seq[Int],
case class Cube(groupByExprs: Seq[Expression],
case class Rollup(groupByExprs: Seq[Expression],
case class VirtualColumn(name: String, dataType: DataType = StringType, nullable: Boolean = false)
case class GroupingSetExpansion(
case class GroupingSetExpansion(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17114/consoleFull

SparkQA · 2014-09-03T13:09:25Z

QA tests have started for PR 1567 at commit 0325be5.

This patch merges cleanly.

SparkQA · 2014-09-03T14:44:44Z

QA tests have finished for PR 1567 at commit 0325be5.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GroupingSet(bitmasks: Seq[Int],
- case class Cube(groupByExprs: Seq[Expression],
- case class Rollup(groupByExprs: Seq[Expression],
- protected case class AttributeEquals(val a: Attribute)
- case class VirtualColumn(name: String, dataType: DataType = StringType, nullable: Boolean = false)
- case class GroupingSetExpansion(
- case class GroupingSetExpansion(

chenghao-intel · 2014-09-04T00:37:47Z

retest this please.

chenghao-intel · 2014-09-04T04:41:57Z

retest this please.

chenghao-intel · 2014-09-04T08:55:34Z

test this please

SparkQA · 2014-09-04T08:59:35Z

QA tests have started for PR 1567 at commit 0325be5.

This patch merges cleanly.

SparkQA · 2014-09-04T10:36:52Z

QA tests have finished for PR 1567 at commit 0325be5.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GroupingSet(bitmasks: Seq[Int],
- case class Cube(groupByExprs: Seq[Expression],
- case class Rollup(groupByExprs: Seq[Expression],
- protected case class AttributeEquals(val a: Attribute)
- case class VirtualColumn(name: String, dataType: DataType = StringType, nullable: Boolean = false)
- case class GroupingSetExpansion(
- case class GroupingSetExpansion(

chenghao-intel · 2014-10-20T08:49:41Z

test this please.

SparkQA · 2014-10-20T08:54:44Z

QA tests have started for PR 1567 at commit 88b939e.

This patch merges cleanly.

AmplabJenkins · 2014-10-20T09:02:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21910/
Test FAILed.

SparkQA · 2014-10-20T09:29:51Z

QA tests have finished for PR 1567 at commit 88b939e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GroupingSet(bitmasks: Seq[Int],
- case class Cube(groupByExprs: Seq[Expression],
- case class Rollup(groupByExprs: Seq[Expression],
- case class VirtualColumn(name: String, dataType: DataType = StringType, nullable: Boolean = false)
- case class GroupingSetExpansion(
- case class GroupingSetExpansion(

AmplabJenkins · 2014-10-20T09:29:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21911/
Test FAILed.

SparkQA · 2014-10-21T00:40:05Z

QA tests have started for PR 1567 at commit 49b4955.

This patch merges cleanly.

chenghao-intel · 2014-10-21T00:42:13Z

Rebased, failed in CliSuite.

@marmbrus @rxin , not sure if you got time to review this. Sorry, it's big PR. I can provide a rough design doc if you think that will be more helpful.

SparkQA · 2014-10-21T01:30:24Z

QA tests have finished for PR 1567 at commit 49b4955.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ReconnectWorker(masterUrl: String) extends DeployMessage
- case class GroupingSet(bitmasks: Seq[Int],
- case class Cube(groupByExprs: Seq[Expression],
- case class Rollup(groupByExprs: Seq[Expression],
- case class VirtualColumn(name: String, dataType: DataType = StringType, nullable: Boolean = false)
- case class GroupingSetExpansion(
- case class GroupingSetExpansion(

AmplabJenkins · 2014-10-21T01:30:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21945/
Test PASSed.

rxin · 2014-10-21T17:41:19Z

A short design doc would be nice. Just talk about the high level design and how it is implemented. Thanks.

marmbrus · 2014-10-21T17:55:18Z

Yeah, please do post a design doc. Also, sorry for not reviewing this earlier. This will be a good feature to have.

I did a quick pass and I have two high level concerns (though I did not look in much detail):

The creation of bit vectors seems like a very implementation focused physical concern. I'm curious if this could be restricted to the actual physical operator.
Adding a new type of attribute reference for virtual columns might be a lot of overhead. Is this really necessary?

SparkQA · 2014-10-23T15:29:44Z

QA tests have started for PR 1567 at commit 76f474e.

This patch merges cleanly.

SparkQA · 2014-10-23T16:22:39Z

QA tests have finished for PR 1567 at commit 76f474e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GroupingSet(bitmasks: Seq[Int],
- case class Cube(groupByExprs: Seq[Expression],
- case class Rollup(groupByExprs: Seq[Expression],
- case class VirtualColumn(name: String, dataType: DataType = StringType, nullable: Boolean = false)
- case class GroupingSetExpansion(
- case class GroupingSetExpansion(

AmplabJenkins · 2014-10-23T16:22:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22073/
Test PASSed.

chenghao-intel · 2014-10-24T04:31:45Z

@rxin @marmbrus , I've uploaded an draft design doc in jira. https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf, sorry it doesn't cover every detail, let me know if you have any confusion.

@marmbrus :

The creation of bit vectors seems like a very implementation focused physical concern. I'm curious if this could be restricted to the actual physical operator.

Yeah, It's very reasonable, I was thinking of this either.
However, the bit vectors stuff don't rely on physical execution engine, and it's slightly different with the Aggregate, which has the optimization of mapside aggregation for spark execution.

Besides, the attribute reference pass down to the parent logical operator need to be correctly set in logical plan analyzing.

Anyway, I will consider your suggestion, after all, we should keep the Logical Plan for "describing what to do", not "how to do".

Adding a new type of attribute reference for virtual columns might be a lot of overhead. Is this really necessary?

A concrete VirtualColumn instance is very helpful in attribute referencing, and pattern matching, probably better than a name convention. Sorry, maybe I didn't understand your mean, we can discuss that in the code review.

marmbrus · 2014-12-17T02:05:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+  object ResolveGroupingSet extends Rule[LogicalPlan] {
+    /**
+     * Extract attribute set according to the grouping id
+     * @param bitmask bitmask to represent the validity of the attribute sequence


I'm not sure what valid means here and elsewhere. Do you mean the bitmask indicates which attributes are selected perhaps?

Yes, exactly.

I think it would be clearer to change the wording then. invalid sounds like something is broken.

marmbrus · 2014-12-17T02:20:10Z

Okay, this is looking really good / clean. Most of my comments are about documentation since this is a very complicated feature.

SparkQA · 2014-12-17T07:02:28Z

Test build #24535 has started for PR 1567 at commit 3c1df19.

This patch merges cleanly.

SparkQA · 2014-12-17T08:10:54Z

Test build #24535 has finished for PR 1567 at commit 3c1df19.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GroupExpression(children: Seq[Expression]) extends Expression
- case class Expand(
- trait GroupingAnalytics extends UnaryNode
- case class GroupingSets(
- case class Cube(
- case class Rollup(
- case class Expand(

AmplabJenkins · 2014-12-17T08:10:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24535/
Test PASSed.

chenghao-intel · 2014-12-17T13:46:53Z

@marmbrus I still have something need to be updated, I will let you know when it's ready.

marmbrus · 2014-12-17T20:59:33Z

Cool, can you through WIP in the title while its being worked on?

SparkQA · 2014-12-18T01:12:26Z

Test build #24563 has started for PR 1567 at commit fe65fcc.

This patch merges cleanly.

chenghao-intel · 2014-12-18T01:13:57Z

Thank you @marmbrus , I've finished the updating, will add "WIP" next time. :)
Can you review the code again?

SparkQA · 2014-12-18T02:23:01Z

Test build #24563 has finished for PR 1567 at commit fe65fcc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GroupExpression(children: Seq[Expression]) extends Expression
- case class Expand(
- trait GroupingAnalytics extends UnaryNode
- case class GroupingSets(
- case class Cube(
- case class Rollup(
- case class Expand(

AmplabJenkins · 2014-12-18T02:23:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24563/
Test PASSed.

chenghao-intel · 2014-12-19T00:25:51Z

@marmbrus , any more comment on this?

marmbrus · 2014-12-19T02:58:42Z

Thanks! Merged to master.

harakiro · 2015-02-25T01:04:03Z

What version of Spark will this be released under? Is it in 1.2? Is there a Jira to track this functionality that I could reference. Thanks so much for the work on this feature!

rxin · 2015-02-25T01:05:52Z

@harakiro the jira ticket is in the title of the pull request: https://issues.apache.org/jira/browse/SPARK-2663

rxin · 2015-02-25T01:51:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Expand.scala

+        private[this] var idx = -1  // -1 means the initial state
+        private[this] var input: Row = _
+
+        override final def hasNext = (-1 < idx && idx < groups.length) || iter.hasNext


fyi you probably want to move groups.length into a variable to avoid running this everytime.

i.e.

private[this] val groupLength = groups.length

and then just reference groupLength

@rxin, the groups is in the type of Array, not Seq, probably it does not impact the performance a lot. Anyway, thank you for pointing out this, I can update that along with some other PR.

chenghao-intel force-pushed the grouping_sets branch from ff919dc to 0325be5 Compare September 3, 2014 13:02

chenghao-intel force-pushed the grouping_sets branch from 0325be5 to bedb8da Compare October 20, 2014 08:43

chenghao-intel force-pushed the grouping_sets branch from 88b939e to 49b4955 Compare October 21, 2014 00:35

chenghao-intel force-pushed the grouping_sets branch from 76f474e to dbded19 Compare November 4, 2014 06:31

marmbrus reviewed Dec 17, 2014
View reviewed changes

chenghao-intel force-pushed the grouping_sets branch from 89e37d8 to 3c1df19 Compare December 17, 2014 06:58

chenghao-intel added 5 commits December 17, 2014 16:58

Support Rollup/Cube/GroupingSets

ec276c6

revert the unnecessary changes

414b165

Add GroupingExpression to replace the Seq[Expression]

d23c672

update code as feedbacks

a7c869d

Add more doc and Simplify the Expand

3547056

chenghao-intel force-pushed the grouping_sets branch from 3c1df19 to 3547056 Compare December 18, 2014 01:08

Remove the extra space

fe65fcc

asfgit closed this in f728e0f Dec 19, 2014

rxin reviewed Feb 25, 2015
View reviewed changes

chenghao-intel deleted the grouping_sets branch July 2, 2015 08:41

JkSelf mentioned this pull request Nov 16, 2022

feat: add expand rel substrait-io/substrait#368

Merged

[SPARK-2663] [SQL] Support the Grouping Set #1567

[SPARK-2663] [SQL] Support the Grouping Set #1567

Conversation

chenghao-intel commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Sep 3, 2014

SparkQA commented Sep 3, 2014

chenghao-intel commented Sep 4, 2014

chenghao-intel commented Sep 4, 2014

chenghao-intel commented Sep 4, 2014

SparkQA commented Sep 4, 2014

SparkQA commented Sep 4, 2014

chenghao-intel commented Oct 20, 2014

SparkQA commented Oct 20, 2014

AmplabJenkins commented Oct 20, 2014

SparkQA commented Oct 20, 2014

AmplabJenkins commented Oct 20, 2014

SparkQA commented Oct 21, 2014

chenghao-intel commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

rxin commented Oct 21, 2014

marmbrus commented Oct 21, 2014

SparkQA commented Oct 23, 2014

SparkQA commented Oct 23, 2014

AmplabJenkins commented Oct 23, 2014

chenghao-intel commented Oct 24, 2014

marmbrus Dec 17, 2014

Choose a reason for hiding this comment

chenghao-intel Dec 17, 2014

Choose a reason for hiding this comment

marmbrus Dec 17, 2014

Choose a reason for hiding this comment

marmbrus commented Dec 17, 2014

SparkQA commented Dec 17, 2014

SparkQA commented Dec 17, 2014

AmplabJenkins commented Dec 17, 2014

chenghao-intel commented Dec 17, 2014

marmbrus commented Dec 17, 2014

SparkQA commented Dec 18, 2014

chenghao-intel commented Dec 18, 2014

SparkQA commented Dec 18, 2014

AmplabJenkins commented Dec 18, 2014

chenghao-intel commented Dec 19, 2014

marmbrus commented Dec 19, 2014

harakiro commented Feb 25, 2015

rxin commented Feb 25, 2015

rxin Feb 25, 2015

Choose a reason for hiding this comment

rxin Feb 25, 2015

Choose a reason for hiding this comment

chenghao-intel Feb 25, 2015

Choose a reason for hiding this comment