-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Short term way to make AggregateStatistics
still work when min/max is converted to udaf
#11261
Short term way to make AggregateStatistics
still work when min/max is converted to udaf
#11261
Conversation
@alamb @edmondop datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs Lines 250 to 257 in 569be9e
It is mainly due to For solving it, I define a |
Thank you @Rachelint ❤️
This makes sense -- I agree the One way I could think to get around this would be to figure out the column statistics in the physical planner and pass it directly (rather than making the UDF figure them out). Something like pub trait AggregateUDFImpl {
...
/// If the output of this aggregate function can be determined
/// only from input statistics (e.g. COUNT(..) with 0 rows, is always 0)
/// return that value
///
/// # Arguments
/// stats: Overall statistics for the input (including row count)
/// arg_stats: Statistics, if known, for each input argument
fn output_from_stats(
&self,
stats: &Statistics,
arg_stats: &[Option<&ColumnStatistics>]
) -> Option<ScalarValue> {
None
}
...
} ANother option might be to use the narrower API described on #11153 impl AggregateExpr {
/// Return the value of the aggregate function, if known, given the number of input rows.
///
/// Return None if the value can not be determined solely from the input.
///
/// # Examples
/// * The `COUNT` aggregate would return `Some(11)` given `num_rows = 11`
/// * The `MIN` aggregate would return `Some(Null)` given `num_rows = 0
/// * The `MIN` aggregate would return `None` given num_rows = 11
fn output_from_rows(&self, num_rows: usize) -> Option<ScalarValue> { None }
...
} Though I think your idea is better (pass in statistics in generla) |
The draft implementation seems reasonable and all checks are passing 🎉 |
@alamb One thing I worry about the narrow api is that, it seems can't be used to support the original optimization of min/max? datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs Lines 195 to 221 in 7df000a
Maybe I misunderstand about it? |
No, sorry, you are correct, I was mistaken |
Ok, been a bit busy the past couple of days, continue to read the related codes and think a relatively good way to solve this today...
|
no worries -- there is a lot going on these days. I think your short term workaround sounds very clever to me 👍 I think it is fine to leave figuring out the right general API to add to AggregateUdfImpl as a future project (my rationale being that special casing "max" / "min" is no worse than hard coding the |
😄 Thanks, I just impl the short term workaround now. |
I agree -- it will not be correct. I think that is why we eventually need to move the logic out of the optimizer and into the AggregateUDFImpl itself. |
33c7410
to
8f52701
Compare
🤔 Seems indeed necessary to do that. |
AggregateStatistics
still work when min/max is converted to udaf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Rachelint -- I think this is an improvement as it will upblock pulling out Min/Max aggregate functions, even thought here is more to do. I also happen to think this PR is easier to understand the intent (by moving is_min
and is_max
checks into functions).
Nice work
my only suggestion is to add comments with a link to the ticket that explains the rationale
Thanks @alamb for review, will continue trying to move them into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @Rachelint -- 🚀 we are making progress
Thank you @Rachelint . I think I can resume #11013 unless @alamb you think there is more work we want to externalize/handle via separate PRs |
I am not sure -- I suggest going through all uses of The goal of #11013 I think should also remove the built in Min/Max aggregate |
Right ! It is looking good. The refactoring has broken an optimiser, will need to investigate |
Makes sense. I am almost done on the other pr I need to understand why one optimisation was lost |
…is converted to udaf (apache#11261) * impl the short term solution. * add todos.
…is converted to udaf (apache#11261) * impl the short term solution. * add todos.
…is converted to udaf (apache#11261) * impl the short term solution. * add todos.
Which issue does this PR close?
Part of #11153
Rationale for this change
Now only hard code like
if agg_expr.as_any().is::<expressions::Min>()
to identify themin/max
aggregate function inAggregateStatistics
optimizer. It can't work aftermin/max
is converted toudaf
.This pr is for solving problem stated above. The best way to solve is adding a function like
output_from_stats
intoAggregateExpr
and moving the logic from the optimizer toudaf
. But unfortunately it is not so easy to find good way to do that...For unblocking the
udaf
conversion aboutmin/max
, just impl a short term solution here.What changes are included in this PR?
Support to identify
min/max
after they are converted toudaf
in later.Are these changes tested?
By exist tests.
Are there any user-facing changes?
No.