Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve: support combining multiple grouping expressions #5559

Merged
merged 2 commits into from
Mar 21, 2023

Conversation

yukkit
Copy link
Contributor

@yukkit yukkit commented Mar 12, 2023

Which issue does this PR close?

Closes #5361 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Mar 12, 2023
@yukkit yukkit marked this pull request as draft March 12, 2023 06:17
@yukkit yukkit force-pushed the multiple-group-expr branch from 8d4b7d3 to 29b7dbf Compare March 12, 2023 12:23
@yukkit yukkit marked this pull request as ready for review March 12, 2023 13:11
@alamb
Copy link
Contributor

alamb commented Mar 13, 2023

I plan to review this PR tomorrow. Thank you @yukkit

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me -- thank you @yukkit

the only thing I think is needed is an error condition check (as in try to pass in 100 columns in a grouping set). If this is unreasonable (or will take too long) I think it should generate a runtime error to protect DataFusion

v
}

/// Convert multiple multiple grouping expressions into one [`GroupingSet::GroupingSets`],\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Convert multiple multiple grouping expressions into one [`GroupingSet::GroupingSets`],\
/// Convert multiple grouping expressions into one [`GroupingSet::GroupingSets`],\

@@ -69,6 +69,94 @@ pub fn grouping_set_expr_count(group_expr: &[Expr]) -> Result<usize> {
}
}

fn powerset<T>(slice: &[T]) -> Vec<Vec<&T>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a docstring explaining what the expected output of this function is intended to be?

For example, is it https://en.wikipedia.org/wiki/Power_set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a docstring explaining what the expected output of this function is intended to be?

For example, is it https://en.wikipedia.org/wiki/Power_set?

I will add comments to it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a power set implementation in itertools, did you consider using it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yukkit can you please answer @ozankabak 's question?

@@ -69,6 +69,94 @@ pub fn grouping_set_expr_count(group_expr: &[Expr]) -> Result<usize> {
}
}

fn powerset<T>(slice: &[T]) -> Vec<Vec<&T>> {
let mut v = Vec::new();
for mask in 0..(1 << slice.len()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should verify that that slice.len() is less than 64 and return an error to avoid overflow / panic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are so right, thanks for the reminder,i will check it


#[test]
fn test_enumerate_grouping_sets() {
let milti_cols = vec![col("col1"), col("col2"), col("col3")];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo here milti --> multi

@yukkit
Copy link
Contributor Author

yukkit commented Mar 16, 2023

This looks good to me -- thank you @yukkit

the only thing I think is needed is an error condition check (as in try to pass in 100 columns in a grouping set). If this is unreasonable (or will take too long) I think it should generate a runtime error to protect DataFusion

@alamb FWIW. DataFusion is a fundamental query engine designed for inheritance by other products. I believe that the responsibility for implementing any necessary restrictions should lie with those who integrate it, as they have a better understanding of the customers and application scenarios. Therefore, we can afford to set a relatively high security threshold, as long as the process remains stable. I noticed that in Doris, the maximum number of distinct sets in the grouping sets clause is 64, while in PostgreSQL, it is 4096, and in DuckDB, it is 65535. Additionally, the maximum number of group expressions in the grouping_set clause is 65535 in DuckDB. I would appreciate your feedback on this.

@yukkit yukkit force-pushed the multiple-group-expr branch from 29b7dbf to cc23be6 Compare March 16, 2023 10:59
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @yukkit -- thank you

@alamb FWIW. DataFusion is a fundamental query engine designed for inheritance by other products. I believe that the responsibility for implementing any necessary restrictions should lie with those who integrate it, as they have a better understanding of the customers and application scenarios. Therefore, we can afford to set a relatively high security threshold, as long as the process remains stable. I noticed that in Doris, the maximum number of distinct sets in the grouping sets clause is 64, while in PostgreSQL, it is 4096, and in DuckDB, it is 65535. Additionally, the maximum number of group expressions in the grouping_set clause is 65535 in DuckDB. I would appreciate your feedback on this.

I agree with the assessment that DataFusion should not try and artificially limit the size of the grouping sets (unless the code can't yet handle them). Thank you for your point

@alamb
Copy link
Contributor

alamb commented Mar 16, 2023

cc @ozankabak

@ozankabak
Copy link
Contributor

LGTM in general, left one inline comment.

@alamb
Copy link
Contributor

alamb commented Mar 20, 2023

This PR also seems to have accumulated a merge conflict -- @yukkit could you please resolve it?

@yukkit yukkit force-pushed the multiple-group-expr branch from cc23be6 to 35b294c Compare March 21, 2023 02:25
@yukkit
Copy link
Contributor Author

yukkit commented Mar 21, 2023

This PR also seems to have accumulated a merge conflict -- @yukkit could you please resolve it?

@alamb Okay, I've resolved the conflict

@mingmwang
Copy link
Contributor

LGTM. I think this PR is ready to merge.

return Ok(group_expr);
}
// only process mix grouping sets
let partial_sets = group_expr
Copy link
Contributor

@mingmwang mingmwang Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we should also process the non-mixed case, so that all the grouping expressions are expanded to GroupingSets, no Cube and Rollup in the logical plan, During the physical planing, then only need to deal with GroupingSets.

@alamb
Copy link
Contributor

alamb commented Mar 21, 2023

This PR had somehow accumulated another conflict, so I took the liberty of merging / resolving it again and pushing to your branch

While there are additional improvements, such as described by @mingmwang on #5559 (comment), I think this PR is better than what is on main and thus I plan to merge it when the CI has passed

@alamb alamb merged commit 68e3040 into apache:main Mar 21, 2023
@mingmwang
Copy link
Contributor

This PR had somehow accumulated another conflict, so I took the liberty of merging / resolving it again and pushing to your branch

While there are additional improvements, such as described by @mingmwang on #5559 (comment), I think this PR is better than what is on main and thus I plan to merge it when the CI has passed

I am working on the following improvement now.

@mingmwang
Copy link
Contributor

I plan to move the enumeration logic to the new Analyzer and keep the constructor logic in the Aggregate simple
and make the constructors(try_new(), try_new_with_schema()) be consistent.

There are some other grouping expressions related open tickets and I will also address them in the following PR.

#5647
#5672

@yukkit
Copy link
Contributor Author

yukkit commented Mar 22, 2023

I plan to move the enumeration logic to the new Analyzer and keep the constructor logic in the Aggregate simple and make the constructors(try_new(), try_new_with_schema()) be consistent.

There are some other grouping expressions related open tickets and I will also address them in the following PR.

#5647 #5672

i agree with your proposal

@andygrove andygrove added the enhancement New feature or request label Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate enhancement New feature or request logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support combinations of grouping set and normal group by exprs
5 participants