-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2663] [SQL] Support the Grouping Set #1567
Conversation
QA tests have started for PR 1567. This patch merges cleanly. |
QA results for PR 1567: |
QA tests have started for PR 1567. This patch merges cleanly. |
QA results for PR 1567: |
ff919dc
to
0325be5
Compare
QA tests have started for PR 1567 at commit
|
QA tests have finished for PR 1567 at commit
|
retest this please. |
1 similar comment
retest this please. |
test this please |
QA tests have started for PR 1567 at commit
|
QA tests have finished for PR 1567 at commit
|
0325be5
to
bedb8da
Compare
test this please. |
QA tests have started for PR 1567 at commit
|
Test FAILed. |
QA tests have finished for PR 1567 at commit
|
Test FAILed. |
88b939e
to
49b4955
Compare
QA tests have started for PR 1567 at commit
|
QA tests have finished for PR 1567 at commit
|
Test PASSed. |
A short design doc would be nice. Just talk about the high level design and how it is implemented. Thanks. |
Yeah, please do post a design doc. Also, sorry for not reviewing this earlier. This will be a good feature to have. I did a quick pass and I have two high level concerns (though I did not look in much detail):
|
QA tests have started for PR 1567 at commit
|
QA tests have finished for PR 1567 at commit
|
Test PASSed. |
@rxin @marmbrus , I've uploaded an draft design doc in jira. https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf, sorry it doesn't cover every detail, let me know if you have any confusion.
Yeah, It's very reasonable, I was thinking of this either. Besides, the attribute reference pass down to the parent logical operator need to be correctly set in logical plan analyzing. Anyway, I will consider your suggestion, after all, we should keep the Logical Plan for "describing what to do", not "how to do".
A concrete |
76f474e
to
dbded19
Compare
object ResolveGroupingSet extends Rule[LogicalPlan] { | ||
/** | ||
* Extract attribute set according to the grouping id | ||
* @param bitmask bitmask to represent the validity of the attribute sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what valid
means here and elsewhere. Do you mean the bitmask indicates which attributes are selected
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer to change the wording then. invalid
sounds like something is broken.
Okay, this is looking really good / clean. Most of my comments are about documentation since this is a very complicated feature. |
89e37d8
to
3c1df19
Compare
Test build #24535 has started for PR 1567 at commit
|
Test build #24535 has finished for PR 1567 at commit
|
Test PASSed. |
@marmbrus I still have something need to be updated, I will let you know when it's ready. |
Cool, can you through |
3c1df19
to
3547056
Compare
Test build #24563 has started for PR 1567 at commit
|
Thank you @marmbrus , I've finished the updating, will add "WIP" next time. :) |
Test build #24563 has finished for PR 1567 at commit
|
Test PASSed. |
@marmbrus , any more comment on this? |
Thanks! Merged to master. |
What version of Spark will this be released under? Is it in 1.2? Is there a Jira to track this functionality that I could reference. Thanks so much for the work on this feature! |
@harakiro the jira ticket is in the title of the pull request: https://issues.apache.org/jira/browse/SPARK-2663 |
private[this] var idx = -1 // -1 means the initial state | ||
private[this] var input: Row = _ | ||
|
||
override final def hasNext = (-1 < idx && idx < groups.length) || iter.hasNext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi you probably want to move groups.length into a variable to avoid running this everytime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e.
private[this] val groupLength = groups.length
and then just reference groupLength
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin, the groups
is in the type of Array
, not Seq
, probably it does not impact the performance a lot. Anyway, thank you for pointing out this, I can update that along with some other PR.
Add support for
GROUPING SETS
,ROLLUP
,CUBE
and the the virtual columnGROUPING__ID
.More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf
The generic idea of the implementations are :
1 Replace the
ROLLUP
,CUBE
withGROUPING SETS
2 Explode each of the input row, and then feed them to
Aggregate
GroupBy Expression List
, for each bit,1
means the expression is selected, otherwise0
(left is the lower bit, and right is the higher bit in theGroupBy Expression List
)Literal(null)
if it's not selected in the grouping set (based on the bit mask)Explode
ischild.output :+ grouping__id
Aggregate
isGroupBy Expression List :+ grouping__id
Aggregation expressions
the same for theAggregate
The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an
Explosive
operator added for Physical Plan, which will explode the rows according the pre-set projections.A known issue will be done in the follow up PR:
ColumnPruning
is not supported yet forExplosive
node.