-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimizer: add framework for the rule of pre-add cast to the literal in comparison binary #3185
optimizer: add framework for the rule of pre-add cast to the literal in comparison binary #3185
Conversation
d5ee16b
to
5fe6e26
Compare
@alamb @andygrove PTAL |
@@ -271,8 +271,8 @@ async fn csv_explain_plans() { | |||
let expected = vec![ | |||
"Explain [plan_type:Utf8, plan:Utf8]", | |||
" Projection: #aggregate_test_100.c1 [c1:Utf8]", | |||
" Filter: #aggregate_test_100.c2 > Int64(10) [c1:Utf8, c2:Int32]", | |||
" TableScan: aggregate_test_100 projection=[c1, c2], partial_filters=[#aggregate_test_100.c2 > Int64(10)] [c1:Utf8, c2:Int32]", | |||
" Filter: #aggregate_test_100.c2 > Int32(10) [c1:Utf8, c2:Int32]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after optimization, the INT64(10) will be cast to INT32(10)
, because of the left type is INT32
} | ||
// TODO: optimize in list | ||
// Expr::InList { .. } => {} | ||
// TODO: handle other expr type and dfs visit them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to support other type of Expr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to add more types in this PR or as a follow-on? If the latter, could you file an issue and reference it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue list sub task for this optimizer. #3031 (comment)
Codecov Report
@@ Coverage Diff @@
## master #3185 +/- ##
==========================================
- Coverage 85.85% 85.81% -0.04%
==========================================
Files 291 292 +1
Lines 52786 53111 +325
==========================================
+ Hits 45320 45579 +259
- Misses 7466 7532 +66
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
5fe6e26
to
6f34a4a
Compare
6f34a4a
to
d3a38d1
Compare
@@ -146,7 +147,20 @@ impl TableProvider for CustomProvider { | |||
match &filters[0] { | |||
Expr::BinaryExpr { right, .. } => { | |||
let int_value = match &**right { | |||
Expr::Literal(ScalarValue::Int8(i)) => i.unwrap() as i64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might want do avoid doing this for NULLs (aka None
) values
Something like:
Expr::Literal(ScalarValue::Int8(i)) => i.unwrap() as i64, | |
Expr::Literal(ScalarValue::Int8(Some(i))) => i as i64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for you comments, I just follow the original implementation.
But I will follow your nice comments.
17d871b
to
eae5133
Compare
@alamb @andygrove is there any comments for this pr? |
@liukun4515 I will start reviewing this PR today. I am also adding a type coercion rule in #3101 and I plan on adding others. Do you think it makes sense to have one type coercion rule or multiple? It might be more efficient to have one rule that can handle all the cases? |
ScalarValue::Int32(Some(v)) => *v as i64, | ||
ScalarValue::Int64(Some(v)) => *v as i64, | ||
other_type => { | ||
panic!("Invalid type and value {:?}", other_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a PR against this PR to remove these panics and use DataFusionError::Internal
instead.
I quite often hit panics in the code that should not be possible, so I think we should use Result
where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I check the data type before this method.
The data type can't hit this panic which is guaranteed by is_support_data_type
fn is_support_data_type(data_type: &DataType) -> bool {
// TODO support decimal with other data type
matches!(
data_type,
DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64
)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I will change the result type to Result<T>
and clean the code with panic.
ScalarValue::Int16(Some(v)) => *v as i64, | ||
ScalarValue::Int32(Some(v)) => *v as i64, | ||
ScalarValue::Int64(Some(v)) => *v, | ||
_ => unimplemented!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method returns Result
so we should return errors here rather than panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it! I'm very excited to see datafusion doing this for new code 😃
if lit_value >= target_min && lit_value <= target_max { | ||
return true; | ||
} | ||
false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if lit_value >= target_min && lit_value <= target_max { | |
return true; | |
} | |
false | |
lit_value >= target_min && lit_value <= target_max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I go through your #3101, which add an optimizer rule to do the type coercion. I want to investigate other system and find out where to do the type coercion for the input expr. |
6bb9c53
to
8455e3c
Compare
I fixed the ci. |
let left_type = left_type.unwrap(); | ||
let right_type = right_type.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the unwrap calls here are safe due to the previous check, but we can still use ?
instead.
let left_type = left_type.unwrap(); | |
let right_type = right_type.unwrap(); | |
let left_type = left_type?; | |
let right_type = right_type?; | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove got it, and will refine this in the next pr 👍
) | ||
.unwrap() => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | |
.unwrap() => | |
)? => |
if origin_value.is_null() { | ||
// if the origin value is null, just convert to another type of null value | ||
// The target type must be satisfied `is_support_data_type` method, we can unwrap safely | ||
return Ok(lit(ScalarValue::try_from(target_type).unwrap())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Ok(lit(ScalarValue::try_from(target_type).unwrap())); | |
return Ok(lit(ScalarValue::try_from(target_type)?)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @liukun4515. I think it is better to use ?
rather than unwrap
even if we think unwrap
is safe, so left some suggestions but these are not blockers so will go ahead and merge.
Benchmark runs are scheduled for baseline = eedc787 and contender = 9ecf277. 9ecf277 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thanks for your time and review. |
Sorry for my delayed response In general I think the coercion logic could use some serious ❤️ in DataFusion as I find it confusing. However, it seems to be working and we are making progress so my plan is to just "go with it" until I hit some issue that gives me an excuse to improve things |
/// which data type is `target_type`. | ||
/// If this false, do nothing. | ||
/// | ||
/// This is inspired by the optimizer rule `UnwrapCastInBinaryComparison` of Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these comments 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's should be done by me
Good doc is the start of good project or code
@alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really helps me review code when there are comments that explain the rationale
Which issue does this PR close?
part of #3031
Rationale for this change
TODO:
other numeric type, for example decimal
support inlist
simplify the binary comparison: like
INT32(C1) > INT64(INT32.MAX)
other feature in the spark rule which can be used in datafusion https://github.com/sunchao/spark/blob/1f496fbea688c7082bad7e6280c8a949fbfd31b7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala#L30
What changes are included in this PR?
Are there any user-facing changes?