-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve: support combining multiple grouping expressions #5559
Conversation
8d4b7d3
to
29b7dbf
Compare
I plan to review this PR tomorrow. Thank you @yukkit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me -- thank you @yukkit
the only thing I think is needed is an error condition check (as in try to pass in 100 columns in a grouping set). If this is unreasonable (or will take too long) I think it should generate a runtime error to protect DataFusion
datafusion/expr/src/utils.rs
Outdated
v | ||
} | ||
|
||
/// Convert multiple multiple grouping expressions into one [`GroupingSet::GroupingSets`],\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Convert multiple multiple grouping expressions into one [`GroupingSet::GroupingSets`],\ | |
/// Convert multiple grouping expressions into one [`GroupingSet::GroupingSets`],\ |
datafusion/expr/src/utils.rs
Outdated
@@ -69,6 +69,94 @@ pub fn grouping_set_expr_count(group_expr: &[Expr]) -> Result<usize> { | |||
} | |||
} | |||
|
|||
fn powerset<T>(slice: &[T]) -> Vec<Vec<&T>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a docstring explaining what the expected output of this function is intended to be?
For example, is it https://en.wikipedia.org/wiki/Power_set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a docstring explaining what the expected output of this function is intended to be?
For example, is it https://en.wikipedia.org/wiki/Power_set?
I will add comments to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a power set implementation in itertools, did you consider using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yukkit can you please answer @ozankabak 's question?
@@ -69,6 +69,94 @@ pub fn grouping_set_expr_count(group_expr: &[Expr]) -> Result<usize> { | |||
} | |||
} | |||
|
|||
fn powerset<T>(slice: &[T]) -> Vec<Vec<&T>> { | |||
let mut v = Vec::new(); | |||
for mask in 0..(1 << slice.len()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should verify that that slice.len() is less than 64 and return an error to avoid overflow / panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are so right, thanks for the reminder,i will check it
datafusion/expr/src/utils.rs
Outdated
|
||
#[test] | ||
fn test_enumerate_grouping_sets() { | ||
let milti_cols = vec![col("col1"), col("col2"), col("col3")]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo here milti
--> multi
@alamb FWIW. DataFusion is a fundamental query engine designed for inheritance by other products. I believe that the responsibility for implementing any necessary restrictions should lie with those who integrate it, as they have a better understanding of the customers and application scenarios. Therefore, we can afford to set a relatively high security threshold, as long as the process remains stable. I noticed that in Doris, the maximum number of distinct sets in the grouping sets clause is 64, while in PostgreSQL, it is 4096, and in DuckDB, it is 65535. Additionally, the maximum number of group expressions in the grouping_set clause is 65535 in DuckDB. I would appreciate your feedback on this. |
29b7dbf
to
cc23be6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @yukkit -- thank you
@alamb FWIW. DataFusion is a fundamental query engine designed for inheritance by other products. I believe that the responsibility for implementing any necessary restrictions should lie with those who integrate it, as they have a better understanding of the customers and application scenarios. Therefore, we can afford to set a relatively high security threshold, as long as the process remains stable. I noticed that in Doris, the maximum number of distinct sets in the grouping sets clause is 64, while in PostgreSQL, it is 4096, and in DuckDB, it is 65535. Additionally, the maximum number of group expressions in the grouping_set clause is 65535 in DuckDB. I would appreciate your feedback on this.
I agree with the assessment that DataFusion should not try and artificially limit the size of the grouping sets (unless the code can't yet handle them). Thank you for your point
cc @ozankabak |
LGTM in general, left one inline comment. |
This PR also seems to have accumulated a merge conflict -- @yukkit could you please resolve it? |
cc23be6
to
35b294c
Compare
LGTM. I think this PR is ready to merge. |
return Ok(group_expr); | ||
} | ||
// only process mix grouping sets | ||
let partial_sets = group_expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we should also process the non-mixed case, so that all the grouping expressions are expanded to GroupingSets
, no Cube
and Rollup
in the logical plan, During the physical planing, then only need to deal with GroupingSets.
This PR had somehow accumulated another conflict, so I took the liberty of merging / resolving it again and pushing to your branch While there are additional improvements, such as described by @mingmwang on #5559 (comment), I think this PR is better than what is on |
I am working on the following improvement now. |
I plan to move the enumeration logic to the new There are some other grouping expressions related open tickets and I will also address them in the following PR. |
i agree with your proposal |
Which issue does this PR close?
Closes #5361 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?