-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add expand rel #368
feat: add expand rel #368
Conversation
@jacques-n @rui-mo Please help to review. Thanks. |
Can you add side content that describes what expand would do? I'm not familiar with the operator. |
@jacques-n Updated the description document. |
ACTION NEEDED Substrait follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
Just for reference PR. |
Furthermore, Group sets are also used to optimize multiple distinct aggregations. see SPARK-9241 and CALCITE-732 I did some research on translating |
I think I understand that Does it look something like this?
Translates to...
|
The physical plan looks like
By inserting |
Ok. I think I understand now. So, in the above example, if the input data was:
Then the output data would be:
|
Is this correct? |
@westonpace Yes. The output also contains the group_id cols.
|
@JkSelf As we discussed offline, it's better to follow spark's semantics to define |
@baibaichen @westonpace I have already updated |
proto/substrait/algebra.proto
Outdated
// A list of expression grouping that the aggregation measured should be calculated for. | ||
repeated Grouping groupings = 4; | ||
|
||
message Grouping { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is ExpandRel
output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to define the output of the physical operator? It seems the output is defined by the consumer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does Aggregate
's operator reference the output of Expand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baibaichen Added the output
in ExpandRel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this. Substrait relations do not typically include names for the output. See the ProjectRel
for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: spark generates different projections
in Expand
with grouping sets
, rollup
and cube
expression. And how we handle these different cases if without the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is, adding a field can simplify the work of the consumer side, why do we need to save this field and make the consumer side more complicated? In addition, these assumptions may also be broken in the future.
I thought this would be very similar to how ProjectRel
is handled. For example:
ProjectRel {
expressions = [some_function(selection(3))]
}
The output's nullability will be determined by the function some_function
. For example, if some_function
is abs
(absolute value) then the type and nullability will be the same as column 3. If some_function
is is_null
then the type will be bool
and the output will be non-null.
In Expand
I think this is a little harder. For example, if we have:
[
[field("country"), field("region")]
[field("country"), literal(null, string())]
[literal(null, string()), literal(null, string())]
]
We have to make sure:
- The output type must be the same for each position (e.g.
field("country")
andliteral(null, string())
must have the same output type.field("region") and
literal(null, string())` must have the same output type). - If any output type in a position is nullable then the output type must be nullable.
Putting this together the code might look like:
def calculate_output(groupings):
output_columns=[]
# Use the first grouping to determine output types
for expression in groupings[0]:
output_columns.append(expression.output_type)
# Make sure the rest of the groupings have the same output types
# and check to see if they are nullable
for grouping in groupings[1:]:
for col_idx in range(len(grouping)):
# types must be same but nullability can differ
if output_columns[col_idx].type != grouping[col_idx].output_type:
raise Exception("All output types for a column must be the same")
if grouping[col_idx].output_type.nullable:
output_columns[col_idx].nullable = true
return output_columns
Note: I think the above algorithm works for both "unreferenced columns" and "measure columns". You don't need to rely on assumption because the the expression literal(null, string())
will always have a nullable output type so the output for that column will always be nullable.
why do we need to save this field and make the consumer side more complicated?
The downsides I see are:
- It is harder to produce the plan. The producer now has to calculate the output.
- It is possible for a plan to be invalid if output doesn't match the expression (e.g. output is
int
but the expression return type isbool
). - The plan is larger than it needs to be
- This is different than ProjectRel and other relations where we expect the consumer to be able to calculate the output schema.
One question: spark generates different projections in Expand with grouping sets, rollup and cube expression. And how we handle these different cases if without the output?
I think that is ok. If you have ROLLUP
then you have:
[
[field("country"), field("region")],
[field("country"), literal(null, string())],
[literal(null, string()), literal(null, string())]
]
If you have CUBE
then you have:
[
[field("country"), field("region")],
[field("country"), literal(null, string())],
[literal(null, string()), field("region")],
[literal(null, string()), literal(null, string())]
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace thanks.
def calculate_output(groupings): ....
- We assume the type should be same at the same posisiton, it would be true now, but I don't think it would always true.
- The cons is all consumers should have the similar codes.
- This is different than ProjectRel and other relations where we expect the consumer to be able to calculate the output schema.
I agree with the 4th. So let's go on with calculate_output
until we find it can't meet the requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace Thanks for your detailed explanation. I remove the output
type in ExpandRel
. Please help to review again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace Thanks for your detailed explanation. I remove the
output
type inExpandRel
. Please help to review again.
@JkSelf Thank you
ae42cc3
to
a404d8c
Compare
This will need a description in the markdown as well. I have created a PR into your branch with a suggestion. |
@westonpace ok to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with where this is. I'll give @jacques-n a chance to comment further before merging.
Also, this PR will need a rebase. |
@westonpace Thanks for your review. I have updated and can you help to review again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor suggestion. @jacques-n can you take another look at this one now? I think the markdown / spec is consistent with the proto files now.
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me.
// each `switching_field` must have the same number of expressions | ||
// all expressions within a switching field must be the same type class but can differ in nullability. | ||
// this column will be nullable if any of the expressions are nullable. | ||
repeated Expression duplicates = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any value in having a repeated inside a repeated? Seems like this should just be a single item (to be consistent with consistent_field above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outer repeated (fields
) is columns and the inner repeated (duplicates
) is rows.
So if our goal is to take input:
X | Y | Z |
---|---|---|
1 | 2 | 3 |
and generate:
X | Y | Z |
---|---|---|
1 | 2 | 3 |
1 | NULL | 3 |
NULL | NULL | 3 |
Then we need something like...
fields: [
{ duplicates: [ field(x), field(x), NULL ] },
{ duplicates: [ field(y), NULL, NULL ] },
{ consistent: field(z) }
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. I've sent #546 to expand the documentation.
Adds an expand relation. This relation can be used to create near-duplicate
copies of each input row based on templates describing how to create the
copies. This is used within spark to implement certain operations like aggregate
rollup and pivot longer.