feat: add expand rel #368

JkSelf · 2022-11-09T10:35:08Z

Adds an expand relation. This relation can be used to create near-duplicate
copies of each input row based on templates describing how to create the
copies. This is used within spark to implement certain operations like aggregate
rollup and pivot longer.

JkSelf · 2022-11-09T10:35:22Z

@jacques-n @rui-mo Please help to review. Thanks.

CLAassistant · 2022-11-09T10:35:24Z

All committers have signed the CLA.

jacques-n · 2022-11-16T02:05:26Z

Can you add side content that describes what expand would do? I'm not familiar with the operator.

JkSelf · 2022-11-16T08:04:14Z

@jacques-n Updated the description document.
Currently the Aggregate Rel output is mainly the Grouping sets expression and Measure. And we need the Grouping sets expression, aggregate expressions and goup ids. So the Aggregate Rel may can't meet the needs of Expand very well.

JkSelf · 2022-11-16T08:06:09Z

cc @FelixYBW @baibaichen

github-actions · 2023-03-15T00:25:23Z

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

proto/substrait/algebra.proto

JkSelf · 2023-03-15T00:57:07Z

Just for reference PR.
Spark add the Expand operator in order to support the GROUPING SETS, ROLLUP, CUBE expression in PR#1567, which apply all the GroupExpressions to every input row and then get multiple output rows for an input row. When we add the Expand operator in gluten, we need adding ExpandRel support in proto.

baibaichen · 2023-03-15T08:55:53Z

@jacques-n @westonpace

Expand operator is introduced to support cube, rollup, and grouping sets(see PR in @JkSelf comments). To support these functions in a naive way, we can scan the source input for each group and then union the results. Expand is used to avoid such multiple scans by ouptuting multiple rows for each input row.

Furthermore, Group sets are also used to optimize multiple distinct aggregations. see SPARK-9241 and CALCITE-732

I did some research on translating Aggregate over Expand to substrait style aggregate. It's ok for normal groupingsets, cube and rollup by its fixed pattern. However, Expand just outputs multiple rows, it can be configured by different pattern, i.e. SPARK-9241. I failed to translate in such case.

westonpace · 2023-03-16T21:13:26Z

I think I understand that EXPAND is a helper that is used by Spark to calculate an aggregate that contains grouping sets. I'm not quite sure I understand how it works. Does each copy get sent to a different output? Can you share a Spark physical plan that makes use of EXPAND?

Does it look something like this?

SELECT Country, Region, SUM(Sales) AS TotalSales
FROM Sales
GROUP BY ROLLUP (Country, Region);

Translates to...

SCAN -> EXPAND --- AGG(keys={country,region}) --- UNION
               \-- AGG(keys={country}) ---------/
               \-- AGG(keys={}) ---------------/

baibaichen · 2023-03-17T03:42:21Z

SCAN -> EXPAND --- AGG(keys={country,region}) --- UNION
               \-- AGG(keys={country}) ---------/
               \-- AGG(keys={}) ---------------/

The physical plan looks like Scan -> Expand(output, projections)-> Aggregate[groupby, measure]

Expand:
  output = sales, Country, Region, grouping_id(Literal)  /* Ouput is the input schema for the next aggregate operator */
  projections =[                                         /* Projections  are the actual inputs for the next aggregate operator */
     [sales, Country, Region, 0],
     [sales, Country, Null,   1],
     [sales, Null, Null,      6]
  ]

Aggregate:
  groupby = Country, Region, grouping_id
  meausre = sum(sales)

By inserting Null and groupingid, Spark eliminates the union

westonpace · 2023-03-17T04:58:51Z

Ok. I think I understand now. So, in the above example, if the input data was:

Country	Region	Sales
US	North America	100
US	North America	200
Canada	North America	300
France	Europe	400

Then the output data would be:

Country	Region	Sales
US	North America	100
US	NULL	100
NULL	NULL	100
US	North America	200
US	NULL	200
NULL	NULL	200
Canada	North America	300
Canada	NULL	300
NULL	NULL	300
France	Europe	400
France	NULL	400
NULL	NULL	400

westonpace · 2023-03-17T13:00:20Z

Is this correct?

JkSelf · 2023-03-17T14:56:15Z

@westonpace Yes. The output also contains the group_id cols.

Sales	Country	Region	grouping_id
100	US	North America	0
100	US	NULL	1
100	NULL	NULL	3
200	US	North America	0
200	US	NULL	1
200	NULL	NULL	3
300	Canada	North America	0
300	Canada	NULL	1
300	NULL	NULL	3
400	France	Europe	0
400	France	NULL	1
400	NULL	NULL	3

baibaichen · 2023-03-20T08:24:07Z

@JkSelf As we discussed offline, it's better to follow spark's semantics to define ExpandRel

JkSelf · 2023-03-20T14:13:03Z

@baibaichen @westonpace I have already updated ExpandRel based on spark's semantics. Please help to review again.

baibaichen · 2023-03-21T00:33:45Z

proto/substrait/algebra.proto

+  // A list of expression grouping that the aggregation measured should be calculated for.
+  repeated Grouping groupings = 4;
+
+  message Grouping {


where is ExpandRel output?

Do we need to define the output of the physical operator? It seems the output is defined by the consumer.

How does Aggregate 's operator reference the output of Expand

@baibaichen Added the output in ExpandRel.

I disagree with this. Substrait relations do not typically include names for the output. See the ProjectRel for an example.

One question: spark generates different projections in Expand with grouping sets, rollup and cube expression. And how we handle these different cases if without the output?

My question is, adding a field can simplify the work of the consumer side, why do we need to save this field and make the consumer side more complicated? In addition, these assumptions may also be broken in the future.

I thought this would be very similar to how ProjectRel is handled. For example:

ProjectRel { expressions = [some_function(selection(3))] }

The output's nullability will be determined by the function some_function. For example, if some_function is abs (absolute value) then the type and nullability will be the same as column 3. If some_function is is_null then the type will be bool and the output will be non-null.

In Expand I think this is a little harder. For example, if we have:

[ [field("country"), field("region")] [field("country"), literal(null, string())] [literal(null, string()), literal(null, string())] ]

We have to make sure:

The output type must be the same for each position (e.g. field("country") and literal(null, string()) must have the same output type. field("region") and literal(null, string())` must have the same output type).

If any output type in a position is nullable then the output type must be nullable.

Putting this together the code might look like:

def calculate_output(groupings): output_columns=[] # Use the first grouping to determine output types for expression in groupings[0]: output_columns.append(expression.output_type) # Make sure the rest of the groupings have the same output types # and check to see if they are nullable for grouping in groupings[1:]: for col_idx in range(len(grouping)): # types must be same but nullability can differ if output_columns[col_idx].type != grouping[col_idx].output_type: raise Exception("All output types for a column must be the same") if grouping[col_idx].output_type.nullable: output_columns[col_idx].nullable = true return output_columns

Note: I think the above algorithm works for both "unreferenced columns" and "measure columns". You don't need to rely on assumption because the the expression literal(null, string()) will always have a nullable output type so the output for that column will always be nullable.

why do we need to save this field and make the consumer side more complicated?

The downsides I see are:

It is harder to produce the plan. The producer now has to calculate the output.

It is possible for a plan to be invalid if output doesn't match the expression (e.g. output is int but the expression return type is bool).

The plan is larger than it needs to be

This is different than ProjectRel and other relations where we expect the consumer to be able to calculate the output schema.

One question: spark generates different projections in Expand with grouping sets, rollup and cube expression. And how we handle these different cases if without the output?

I think that is ok. If you have ROLLUP then you have:

[ [field("country"), field("region")], [field("country"), literal(null, string())], [literal(null, string()), literal(null, string())] ]

If you have CUBE then you have:

[ [field("country"), field("region")], [field("country"), literal(null, string())], [literal(null, string()), field("region")], [literal(null, string()), literal(null, string())] ]

@westonpace thanks.

def calculate_output(groupings): ....

We assume the type should be same at the same posisiton, it would be true now, but I don't think it would always true.

The cons is all consumers should have the similar codes.

This is different than ProjectRel and other relations where we expect the consumer to be able to calculate the output schema.

I agree with the 4th. So let's go on with calculate_output until we find it can't meet the requirements.

@westonpace Thanks for your detailed explanation. I remove the output type in ExpandRel. Please help to review again.

@westonpace Thanks for your detailed explanation. I remove the output type in ExpandRel. Please help to review again.

@JkSelf Thank you

westonpace · 2023-03-22T21:38:59Z

This will need a description in the markdown as well. I have created a PR into your branch with a suggestion.

baibaichen · 2023-03-30T00:23:46Z

@westonpace ok to merge?

westonpace

I'm happy with where this is. I'll give @jacques-n a chance to comment further before merging.

westonpace · 2023-08-02T19:53:43Z

Also, this PR will need a rebase.

JkSelf · 2023-08-03T03:46:22Z

@westonpace Thanks for your review. I have updated and can you help to review again?

westonpace

One minor suggestion. @jacques-n can you take another look at this one now? I think the markdown / spec is consistent with the proto files now.

site/docs/relations/physical_relations.md

Co-authored-by: Weston Pace <[email protected]>

jacques-n

looks good to me.

EpsilonPrime · 2023-08-22T00:56:50Z

proto/substrait/algebra.proto

+    // each `switching_field` must have the same number of expressions
+    // all expressions within a switching field must be the same type class but can differ in nullability.
+    // this column will be nullable if any of the expressions are nullable.
+    repeated Expression duplicates = 1;


Is there any value in having a repeated inside a repeated? Seems like this should just be a single item (to be consistent with consistent_field above).

The outer repeated (fields) is columns and the inner repeated (duplicates) is rows.

So if our goal is to take input:

X Y Z

1 2 3

and generate:

X Y Z

1 2 3

1 NULL 3

NULL NULL 3

Then we need something like...

fields: [ { duplicates: [ field(x), field(x), NULL ] }, { duplicates: [ field(y), NULL, NULL ] }, { consistent: field(z) } ]

Thanks for the clarification. I've sent #546 to expand the documentation.

JkSelf changed the title ~~feat: Add expand Rel~~ feat: add expand rel in proto Dec 1, 2022

JkSelf force-pushed the add-expandrel branch from e07d550 to c6c58b8 Compare March 15, 2023 00:24

JkSelf force-pushed the add-expandrel branch from c6c58b8 to 1ecacc3 Compare March 15, 2023 00:42

JkSelf changed the title ~~feat: add expand rel in proto~~ feat: add expand rel Mar 15, 2023

EpsilonPrime reviewed Mar 15, 2023

View reviewed changes

proto/substrait/algebra.proto Show resolved Hide resolved

JkSelf force-pushed the add-expandrel branch from 6e1136f to a644b1c Compare March 15, 2023 04:15

JkSelf force-pushed the add-expandrel branch from a644b1c to 03c40df Compare March 16, 2023 10:05

JkSelf force-pushed the add-expandrel branch from 03c40df to 0789809 Compare March 20, 2023 14:07

baibaichen reviewed Mar 21, 2023

View reviewed changes

JkSelf force-pushed the add-expandrel branch 2 times, most recently from ae42cc3 to a404d8c Compare March 21, 2023 09:23

JkSelf force-pushed the add-expandrel branch from 9556d7d to e49ce70 Compare March 29, 2023 09:09

westonpace previously approved these changes Mar 30, 2023

View reviewed changes

JkSelf force-pushed the add-expandrel branch from 136833c to 33bdc86 Compare August 3, 2023 02:53

JkSelf requested a review from cpcloud as a code owner August 3, 2023 02:53

westonpace previously approved these changes Aug 3, 2023

View reviewed changes

site/docs/relations/physical_relations.md Outdated Show resolved Hide resolved

JkSelf dismissed westonpace’s stale review via 088af0a August 3, 2023 04:06

JkSelf and others added 20 commits August 3, 2023 10:08

feat: add expand rel

6579449

Add a description of the operation to the markdown

ae819cd

feat: remove ouput type

e8fc510

feat: resolve comments

3392ccf

Update proto/substrait/algebra.proto

b0aae25

Co-authored-by: Weston Pace <[email protected]>

Update proto/substrait/algebra.proto

c3799d6

Co-authored-by: Weston Pace <[email protected]>

Update proto/substrait/algebra.proto

9f05cb4

Co-authored-by: Weston Pace <[email protected]>

Update proto/substrait/algebra.proto

3a0241f

Co-authored-by: Weston Pace <[email protected]>

feat: refine physical_relations.md

0433813

Update site/docs/relations/physical_relations.md

8fe08e7

Co-authored-by: Weston Pace <[email protected]>

feat: refine physical_relations.md

7f49ed7

Update site/docs/relations/physical_relations.md

b405e56

Co-authored-by: Weston Pace <[email protected]>

feat: refine physical_relations.md

f5a3db9

feat: resolve comment

6d25796

Update proto/substrait/algebra.proto

b8a05c2

Co-authored-by: Weston Pace <[email protected]>

Update site/docs/relations/physical_relations.md

9c06a2b

Co-authored-by: Weston Pace <[email protected]>

Update site/docs/relations/physical_relations.md

3bdd963

Co-authored-by: Weston Pace <[email protected]>

Update site/docs/relations/physical_relations.md

33bdc86

Co-authored-by: Weston Pace <[email protected]>

code foramt

ecb7b0a

Update site/docs/relations/physical_relations.md

088af0a

Co-authored-by: Weston Pace <[email protected]>

jacques-n approved these changes Aug 10, 2023

View reviewed changes

westonpace merged commit 98380b0 into substrait-io:main Aug 10, 2023

vbarua mentioned this pull request Aug 10, 2023

chore: update substrait to 0.31.0 substrait-io/substrait-java#166

Merged

EpsilonPrime reviewed Aug 22, 2023

View reviewed changes

feat: add expand rel #368

feat: add expand rel #368

Conversation

JkSelf commented Nov 9, 2022 • edited by westonpace Loading

JkSelf commented Nov 9, 2022

CLAassistant commented Nov 9, 2022 • edited Loading

jacques-n commented Nov 16, 2022

JkSelf commented Nov 16, 2022

JkSelf commented Nov 16, 2022

github-actions bot commented Mar 15, 2023

JkSelf commented Mar 15, 2023

baibaichen commented Mar 15, 2023

westonpace commented Mar 16, 2023

baibaichen commented Mar 17, 2023 • edited Loading

westonpace commented Mar 17, 2023

westonpace commented Mar 17, 2023

JkSelf commented Mar 17, 2023

baibaichen commented Mar 20, 2023

JkSelf commented Mar 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baibaichen Mar 30, 2023 • edited Loading

Choose a reason for hiding this comment

westonpace commented Mar 22, 2023

baibaichen commented Mar 30, 2023

westonpace left a comment

Choose a reason for hiding this comment

westonpace commented Aug 2, 2023

JkSelf commented Aug 3, 2023

westonpace left a comment

Choose a reason for hiding this comment

jacques-n left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JkSelf commented Nov 9, 2022 •

edited by westonpace

Loading

CLAassistant commented Nov 9, 2022 •

edited

Loading

baibaichen commented Mar 17, 2023 •

edited

Loading

baibaichen Mar 30, 2023 •

edited

Loading

westonpace Aug 22, 2023 •

edited

Loading