Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add coercion rules for AggregateFunctions #1387

Merged
merged 6 commits into from
Dec 7, 2021

Conversation

liukun4515
Copy link
Contributor

@liukun4515 liukun4515 commented Dec 1, 2021

Which issue does this PR close?

part of #1356

We can extract two common function for all coercion rules, which are coerce_types and coerce_exprs.

The coerce_types is used to get the coerced result data type if the expr need to be added a tyr_cast.
The coerce_exprs is used to get the coerced exprs.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels Dec 1, 2021
@liukun4515 liukun4515 changed the title Add new framework expr coercion refactor the expr coercion[aggregate expr function] Dec 1, 2021
@liukun4515
Copy link
Contributor Author

@alamb PTAL
In the #1356, we reach the same idea to refactor the coercion framework and split the coercion rules to a single module.
This is the part for type coercion, and just for aggregate function.
If this approach is fine for you and the community, we can continue other refactoring work, such as scalar function expr, udf, UDAF.

@alamb alamb changed the title refactor the expr coercion[aggregate expr function] Add coercion rules for AggregateFunctions Dec 1, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liukun4515 I really like where this is heading and it makes sense to me to have aggregate coercion rules that are separate from coercion rules for other functions.

I am curious how you see the overall code structure evolving. Is your idea that there will be several coercion modules, for each different Expr type, for example?

datafusion/src/physical_plan/coercion_rule/aggregate_rule.rs
datafusion/src/physical_plan/coercion_rule/function.rs
datafusion/src/physical_plan/coercion_rule/operators.rs

I had a few suggestions / comments, but overall I think this is looking almost ready to merge

Thank you for starting this work

@@ -2058,7 +2058,7 @@ mod tests {
.await
.unwrap_err();

assert_eq!(results.to_string(), "Error during planning: Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.");
assert_eq!(results.to_string(), "Error during planning: The function Sum do not support the Timestamp(Nanosecond, None).");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add the valid signatures into this error message? The new wording is more readable, but we did lose some information about what type signatures are valid

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some SQL systems, if we input the incompatible datatype, they just throw the error or except and don't give the compatible data type.
We can refine this later if the supported data type is necessary for the user.

match signature.type_signature {
TypeSignature::Uniform(agg_count, _) | TypeSignature::Any(agg_count) => {
if input_types.len() != agg_count {
return Err(DataFusionError::Plan(format!("The function {:?} expect argument number is {:?}, but the input argument number is {:?}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Err(DataFusionError::Plan(format!("The function {:?} expect argument number is {:?}, but the input argument number is {:?}",
return Err(DataFusionError::Plan(format!("The function {:?} expects {:?} arguments, but {:?} were provided",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion.
Done

use std::ops::Deref;
use std::sync::Arc;

pub fn coerce_types(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some docstrings would help here (perhaps a pointer to the module level documentation as is done in type_coercion.rs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have added the doc for these two functions.

// TODO add checker, if the value type is complex data type
Ok(vec![dict_value_type.deref().clone()])
}
// TODO add checker for datatype which min and max supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a good TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add the checker in the follow-up pull request.
In this pull request, we just make consistent with before logic.

let input_types = vec![
vec![DataType::Int32],
vec![DataType::Float32],
// vec![DataType::Decimal(20, 3)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, here I think should work now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will support the decimal data type for SUM/AVG function in the feature #1408.
I have added the TODO to mark it.

// specific language governing permissions and limitations
// under the License.

//! define the coercion rule for different Expr type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! define the coercion rule for different Expr type
//! define coercion rules for aggregate functions

@@ -62,6 +62,24 @@ pub fn avg_return_type(arg_type: &DataType) -> Result<DataType> {
}
}

pub(crate) fn is_avg_support_arg_type(arg_type: &DataType) -> bool {
// TODO support the interval
// TODO: do we need to support the unsigned data type?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by the unsigned data type? If you mean "do we support AVG(UInt8)" I think the answer is yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

@@ -60,6 +60,7 @@ pub mod helpers {

pub use approx_distinct::ApproxDistinct;
pub use array_agg::ArrayAgg;
pub(crate) use average::is_avg_support_arg_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to export this function (aka no need to declare it pub crate in this function). Likewise below

Copy link
Contributor Author

@liukun4515 liukun4515 Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_eq!(err.to_string(), "Error during planning: Invalid or wrong number of arguments passed to aggregate: 'COUNT(DISTINCT )'");
let logical_plan = ctx.create_logical_plan(sql);
let err = logical_plan.unwrap_err();
assert_eq!(err.to_string(), DataFusionError::Plan("The function Count expect argument number is 1, but the input argument number is 0".to_string()).to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a nicer error message for sure 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be changed to Error during planning: The function Count expects 1 arguments, but 0 were provided
After this comments https://github.com/apache/arrow-datafusion/pull/1387/files#r760575015

@liukun4515
Copy link
Contributor Author

@alamb I wish this refactor can involved many community developer or committer, and other developer can know this context.
I can remain this pull request for a while and don't merge it quickly.
What about your opinion?

@liukun4515
Copy link
Contributor Author

Thank you for your review and grammar suggestions.
I will address them later.

@liukun4515
Copy link
Contributor Author

but overall I think this is looking almost ready to merge

Thanks @liukun4515 I really like where this is heading and it makes sense to me to have aggregate coercion rules that are separate from coercion rules for other functions.

I am curious how you see the overall code structure evolving. Is your idea that there will be several coercion modules, for each different Expr type, for example?

datafusion/src/physical_plan/coercion_rule/aggregate_rule.rs
datafusion/src/physical_plan/coercion_rule/function.rs
datafusion/src/physical_plan/coercion_rule/operators.rs

Yes, this is the future result and diff type exprs have diff coercion rule in diff file.
The code structure will look like the example your provided.

I had a few suggestions / comments, but overall I think this is looking almost ready to merge

Thank you for starting this work

@alamb
Copy link
Contributor

alamb commented Dec 2, 2021

I can remain this pull request for a while and don't merge it quickly.

I think that is a good idea @liukun4515 -- though the danger of doing so is that we will end up with merge conflicts

Perhaps @Dandandan @houqp @rdettai or @jimexist would like to weigh in on the structure and direction as well

@xudong963
Copy link
Member

Thanks for your work! @liukun4515. If possible(the ticket doesn't be merged), I'll take a look at the weekend, because these days I am so busy with my job demands 🤦‍♂️, work 12 + hours per day ......

@liukun4515
Copy link
Contributor Author

@xudong963 I will begin to address the comments from next week.
This pull request is not urgent.

@alamb
Copy link
Contributor

alamb commented Dec 6, 2021

@xudong963 and @liukun4515 what is the status of this PR? Shall we merge it? Is it waiting on more review / work?

@liukun4515
Copy link
Contributor Author

@xudong963 and @liukun4515 what is the status of this PR? Shall we merge it? Is it waiting on more review / work?

@alamb
I'm waiting for review comments these days.
But from today, I will address all comments.
After comments are addressed, I will ask someone to re-review.

@liukun4515
Copy link
Contributor Author

liukun4515 commented Dec 7, 2021

@alamb PTAL
It can be merged if it is looking good to you.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @liukun4515 I think this is good to go.

It would be nice to clean up the tests but we can always do that as a follow on PR

Comment on lines +420 to +442
let mut expect_type = data_type.clone();
if matches!(
data_type,
DataType::UInt8
| DataType::UInt16
| DataType::UInt32
| DataType::UInt64
) {
expect_type = DataType::UInt64;
} else if matches!(
data_type,
DataType::Int8
| DataType::Int16
| DataType::Int32
| DataType::Int64
) {
expect_type = DataType::Int64;
} else if matches!(
data_type,
DataType::Float32 | DataType::Float64
) {
expect_type = data_type.clone();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI you can write this kind of logic in a more concise way with something like (untested and abbreviated)

let expect_type = match (data_type) {
  DataType::UInt8 | .... => DataType::UInt64,
  DataType::Int8 | .... => DataType::Int64,
  _ => data_type.clone()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1416 <-- PR wth proposed cleanup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good style and suggestion!

@alamb alamb merged commit 415c5e1 into apache:master Dec 7, 2021
@alamb
Copy link
Contributor

alamb commented Dec 7, 2021

Thanks also to @xudong963 for the review

@alamb alamb mentioned this pull request Dec 7, 2021
@houqp
Copy link
Member

houqp commented Dec 8, 2021

Thank you @liukun4515 for your contribution on this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants