-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup type coercion (#3419) #6778
Conversation
// utf8 = utf8 | ||
col("c1").eq(col("c1")), | ||
// u8 AND u8 | ||
col("c3").and(col("c3")), | ||
// utf8 = bool | ||
col("c1").eq(bool_expr.clone()), | ||
// u32 AND bool | ||
col("c2").and(bool_expr), | ||
col("c3").bitand(col("c3")), | ||
// utf8 = u8 | ||
col("c1").eq(col("c3")), | ||
// bool AND bool | ||
bool_expr.clone().and(bool_expr), | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what this test was supposed to be testing, but a lot of these expression are ill-formed and now return an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps @izveigor remembers 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this test, but it really seems redundant.
Operator::Multiply | | ||
Operator::Divide| | ||
Operator::Modulo => { | ||
// TODO: this logic would be easier to follow if the functions were inlined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to do this a follow up PR, but wanted to avoid conflating too many things in one PR
// date - date | ||
(Date32, Date32) => Some(Interval(DayTime)), | ||
(Date64, Date64) => Some(Interval(MonthDayNano)), | ||
(Date32, Date64) | (Date64, Date32) => Some(Interval(MonthDayNano)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not actually supported, so I'm not sure why it was here
// date - timestamp, timestamp - date | ||
(Date32, Timestamp(_, _)) | ||
| (Timestamp(_, _), Date32) | ||
| (Date64, Timestamp(_, _)) | ||
| (Timestamp(_, _), Date64) => { | ||
// TODO: make get_result_type must after coerce type. | ||
// if type isn't coerced, we need get common type, and then get result type. | ||
let common_type = temporal_coercion(lhs_type, rhs_type); | ||
common_type.and_then(|t| mathematics_temporal_result_type(&t, &t)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formulation using signature avoids this, this function is only fed types after coercion
} | ||
|
||
/// Coercion rules for mathematics operators between decimal and non-decimal types. | ||
pub fn math_decimal_coercion( | ||
lhs_type: &DataType, | ||
rhs_type: &DataType, | ||
) -> (Option<DataType>, Option<DataType>) { | ||
) -> Option<(DataType, DataType)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this method to be more consistent with the signature of other coercion functions, i.e. returning None
to indicate a coercion failure, instead of don't coerce
@@ -743,9 +714,6 @@ fn temporal_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataTyp | |||
use arrow::datatypes::IntervalUnit::*; | |||
use arrow::datatypes::TimeUnit::*; | |||
|
|||
if lhs_type == rhs_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was incorrect if the types are not temporal, rather than adding a check, I just removed it as we explicitly test the types below
.to_string(); | ||
assert_contains!(&err, "Date32 + Date64 can't be evaluated because there isn't a common type to coerce the types to"); | ||
let (lhs, rhs) = | ||
get_input_types(&DataType::Date32, &Operator::Plus, &DataType::Date64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I think it was a bug that this wasn't supported, there is no good reason that I can see as to why we wouldn't coerce to the larger representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point I think the arrow cast kernels didn't support the coercion so perhaps that is why it was not done automatically
op, | ||
ref right, | ||
}) => { | ||
// this is a workaround for https://github.com/apache/arrow-datafusion/issues/3419 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -499,6 +492,7 @@ pub fn coercion_decimal_mathematics_type( | |||
left_decimal_type: &DataType, | |||
right_decimal_type: &DataType, | |||
) -> Option<DataType> { | |||
// TODO: Move this logic into kernel implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to this as a follow on, it is peculiar that BinaryExpr is performing a precision/scale coercion prior to calling kernels, which then in the case of divide then do their own scale coercion. The kernels should just be smart enough to perform the arithmetic correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @viirya
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, if the coercion rule that actually comes from Spark is considered to be general enough for all downstreams.
@@ -432,11 +432,15 @@ select '1 month'::interval + '1980-01-01T12:00:00'::timestamp; | |||
|
|||
# Exected error: interval (scalar) - date / timestamp (scalar) | |||
|
|||
query error DataFusion error: type_coercion\ncaused by\nError during planning: Interval\(MonthDayNano\) \- Date32 can't be evaluated because there isn't a common type to coerce the types to | |||
query D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this work wasn't actually intentional, but I don't see a reason why they shouldn't work
@@ -38,14 +173,6 @@ fn mathematics_temporal_result_type( | |||
use arrow::datatypes::IntervalUnit::*; | |||
use arrow::datatypes::TimeUnit::*; | |||
|
|||
if !is_interval(lhs_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is unnecessary given the match block that follows
match (lhs_type, rhs_type) { | ||
// interval +/- | ||
(Interval(_), Interval(_)) => Some(Interval(MonthDayNano)), | ||
(Date64, Date32) | (Date32, Date64) => Some(Date64), | ||
(Utf8, Date32) | (Date32, Utf8) => Some(Date32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was split out into a separate string_temporal_coercion as it cannot be used for arithmetic
} | ||
_ => (None, None), | ||
// Unlike with comparison we don't coerce to a decimal in the case of floating point | ||
// numbers, instead falling back to floating point arithmetic instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this behaviour was intentional or accidental, but I opted to preserve it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👨🍳 👌 -- this is a very nice cleanup -- thank you very much @tustvold
I had some stylistic nit-picks but nothing that should prevent this from merging in my mind.
Thank you for cleaning this up
cc @jackwener and @liukun4515 as I think this is a topic you have previously expressed interest in as well
// utf8 = utf8 | ||
col("c1").eq(col("c1")), | ||
// u8 AND u8 | ||
col("c3").and(col("c3")), | ||
// utf8 = bool | ||
col("c1").eq(bool_expr.clone()), | ||
// u32 AND bool | ||
col("c2").and(bool_expr), | ||
col("c3").bitand(col("c3")), | ||
// utf8 = u8 | ||
col("c1").eq(col("c3")), | ||
// bool AND bool | ||
bool_expr.clone().and(bool_expr), | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps @izveigor remembers 🤔
@@ -432,11 +432,15 @@ select '1 month'::interval + '1980-01-01T12:00:00'::timestamp; | |||
|
|||
# Exected error: interval (scalar) - date / timestamp (scalar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Exected error: interval (scalar) - date / timestamp (scalar) |
op, | ||
ref right, | ||
}) => { | ||
// this is a workaround for https://github.com/apache/arrow-datafusion/issues/3419 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
Expr::BinaryExpr(BinaryExpr { left, op, right }) => { | ||
let lhs = left.get_type(&self.schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the names left_type
and right_type
I think would be more consistent with the rest of the code in this file and more readable. The convention in the rest of the code seems to be lhs
and rhs
refer to the left and right argument Expr
themselves
The same comment applies to the other changes in this file too
use crate::Operator; | ||
|
||
/// The type signature of an instantiation of binary expression | ||
struct Signature { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can use a different name for this struture as there is already a Signature
in https://docs.rs/datafusion-expr/26.0.0/datafusion_expr/struct.Signature.html
Perhaps OpSignature
or BinarySignature
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is module private I personally think this is fine, I generally dislike adding additional qualifiers to types, we can always change down the line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I really dislike is overloading names of things that are different -- my feeble mind can't mentally alias / keep the two separate
} | ||
} | ||
|
||
/// Returns a [`Signature`] for applying `op` to arguments of type `lhs` and `rhs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ This is perfect. Thank you
/// Returns the output type of applying numeric operations such as `=` | ||
/// to arguments `lhs_type` and `rhs_type` if both are numeric | ||
/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of a comparison operation | ||
/// where one is numeric and one is `Utf8`/`LargeUtf8`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// where one is numeric and one is `Utf8`/`LargeUtf8`. | |
/// where one is temporal and one is `Utf8`/`LargeUtf8`. |
@@ -499,6 +492,7 @@ pub fn coercion_decimal_mathematics_type( | |||
left_decimal_type: &DataType, | |||
right_decimal_type: &DataType, | |||
) -> Option<DataType> { | |||
// TODO: Move this logic into kernel implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @viirya
assert_eq!(e, "Float32 + Utf8 can't be evaluated because there isn't a common type to coerce the types to"); | ||
assert_eq!( | ||
e, | ||
"Cannot coerce arithmetic expression Float32 + Utf8 to valid types" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new messages are easier to understand 👍
.to_string(); | ||
assert_contains!(&err, "Date32 + Date64 can't be evaluated because there isn't a common type to coerce the types to"); | ||
let (lhs, rhs) = | ||
get_input_types(&DataType::Date32, &Operator::Plus, &DataType::Date64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point I think the arrow cast kernels didn't support the coercion so perhaps that is why it was not done automatically
Which issue does this PR close?
Part of #3419
Rationale for this change
This is the first of what will be a series of PRs to consolidate the type coercion logic and make it easier to follow. In particular this PR adds the ability to express type coercion of heterogeneous types, as occurs in arithmetic. This in turn feeds into the effort to move more of this logic upstream apache/arrow-rs#3999
What changes are included in this PR?
Replaces
coerce_types
withget_input_types
that returns separate types for the left and right expressions. It also consolidates the coercion logic in a single privatesignature
function to avoid inconsistency and reduce repetition.Are these changes tested?
Are there any user-facing changes?