-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for utilization complex aggregate exprs inside group by #8107
Add support for utilization complex aggregate exprs inside group by #8107
Conversation
@alamb, this should solve the test that fails on your end. Can you please confirm? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @mustafasrepo -- I am a little confused about the algorithm (I left some questions)
@alamb, this should solve the test that fails on your end. Can you please confirm?
I tried @ozankabak and unfortunately it does not solve our issue. I will work tomorrow on getting a datafusion only reproducer for our issue and add it to the ticket.
let projection_mapping = self.implicit_projection_mapping(&exprs); | ||
let projected_eqs = | ||
self.project(&projection_mapping, projection_mapping.output_schema()); | ||
if let Some(projected_reqs) = projection_mapping.project_lex_reqs(reqs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure I follow this logic -- does it say that any expression that can be projected maintains the ordering?
What about non monotonic expressions like abs(x)
? If the input is orderered by x
-2
-1
0
1
2
The output will not be ordered by abs(xx)
2
1
0
1
2
(the same applies to date functions like extract(day from ts)
Perhaps we need to check FuncMonotonicity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure I follow this logic -- does it say that any expression that can be projected maintains the ordering?
No, it just projects everything as if they are evaluated (in terms of ordering properties). For instance if ts
is not ordered. We know that date_bin(ts)
is not ordered. Similarly abs(x)
is not ordered no matter x
is ordered or not. update_ordering
considers the properties of functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was trying to say that in general, for ProjectionExec
and any node that computes expressions, I would expect that we would have to use FuncMonotonicity
to determine if an expression of x
is ordered by expr(x)
Maybe that would be a good thing to do as a follow on PR
Thanks, it will help us pinpoint exactly what is wrong |
…synnada-ai/arrow-datafusion into bug_fix/complex_aggregate_exprs
By the way, I want to stress that this PR is beneficial even if it doesn't solve the issue in #8043. This PR enables us to run some of the aggregate queries in ordered or partially ordered mode. As can be seen from the test result
Also after this PR I would expect that following query result mentioned in the #8043.
would turn into (pointed
Then enforce sorting wouldn't remove the |
I marked this PR as draft. The reason is that, I have realized that we may not need to have implicit projection pass to have this support. If so, having implicit projection is a bit, cumbersome. I will try to implement alternative algorithm. |
# Conflicts: # datafusion/core/src/physical_optimizer/enforce_distribution.rs # datafusion/core/src/physical_optimizer/replace_with_order_preserving_variants.rs # datafusion/physical-expr/src/equivalence.rs # datafusion/sqllogictest/test_files/groupby.slt
This is complete in PR8281 |
Which issue does this PR close?
Closes #8043. Part of #8064
Rationale for this change
Group by cannot understand ordering of the complex expression currently.
Assume that we know that column
ts
is ordered. When some writes a query in the formGROUP BY (ts)
we can understand thatts
is ordered, then runAggregateExec
inSorted
mode. However, when some writesdate_bin(ts)
, we cannot understanddate_bin(ts)
is indeed ordered givents
is ordered (Please note that datafusion has this mechanism. However, in group by we do not use this mechanism).This PR solves this problem.
The design is as follows, we apply an implicit projection before checking
Then existing code runs as before. This implicit projection enables us to treat complex expression as columns, where their result are calculated(Please note that this calculation is done only in terms of ordering properties. There is no computation).
Assume that existing ordering is
ts ASC
. Also assume that some writesGROUP BY (date_bin(ts))
in the query.Implicit projection do following projection:
ts -> col(ts), date_bin(ts)-> col(date_bin(ts))
.This projection transforms existing ordering from
ts ASC
to[ts ASC], [date_bin(ts) ASC]
. Then existing algorithms works on this projected form.Please note that: Changes in the
replace_order_preserving_variants.rs
file only comes from schema changes. Previously some of the plans were invalid (such column should be at index 1, however it is at index 2). However, they were silent. Because these plans never executed. With this new algorithm, algorithm considers existing schema during projection also. Hence these silent bugs aroused. I also fixed them as part of this PR.What changes are included in this PR?
Are these changes tested?
Yes new tests are added
Are there any user-facing changes?