-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Support Dict types in in_list
physical plans
#10031
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb and @jayzhan211 PTAL when you have time.
I think the dict is already flattened, we can just relax the type check and we are good to go.
@@ -540,7 +553,7 @@ mod tests { | |||
&schema | |||
); | |||
|
|||
// expression: "a not in ("a", "b")" | |||
// expression: "a in ("a", "b")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong, just go ahead and fix it.
Thank you @advancedxy -- would it be possible to add some end to end tests as well in https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest? Perhaps we could extend https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt |
Of course, let me try to add some E2E tests. |
Fixed, please let me if you have any other comments. |
@@ -415,6 +415,18 @@ impl PartialEq<dyn Any> for InListExpr { | |||
} | |||
} | |||
|
|||
/// Checks if two types are logically equal, dictionary types are compared by their value types. | |||
fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I think we have a similar method in https://github.com/apache/arrow-datafusion/blob/f55c1d90215614ce531a4103c7dbebf318de1cfd/datafusion/common/src/dfschema.rs#L643 basically the schema struct has a lot of equal checks between datatypes. may want to move them into separate utils struct as they mostly working with datatypes rather than schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we could use datatype_is_logically_equal
instead here. It also seems like a good idea to make that function more discoverable -- I also think it would be fine to do as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me promote and use datatype_is_logically_equal
in a follow up PR.
I noticed that method earlier, but thought it might be a bit overkill for cases here in the in_list
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests don't seem to cover the new code.
When I reverted the code change locally like this:
diff --git a/datafusion/physical-expr/src/expressions/in_list.rs b/datafusion/physical-expr/src/expressions/in_list.rs
index ad2da57b5..5d9fb19af 100644
--- a/datafusion/physical-expr/src/expressions/in_list.rs
+++ b/datafusion/physical-expr/src/expressions/in_list.rs
@@ -415,18 +415,6 @@ impl PartialEq<dyn Any> for InListExpr {
}
}
-/// Checks if two types are logically equal, dictionary types are compared by their value types.
-fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool {
- match (lhs, rhs) {
- (DataType::Dictionary(_, v1), DataType::Dictionary(_, v2)) => {
- v1.as_ref().eq(v2.as_ref())
- }
- (DataType::Dictionary(_, l), _) => l.as_ref().eq(rhs),
- (_, DataType::Dictionary(_, r)) => lhs.eq(r.as_ref()),
- _ => lhs.eq(rhs),
- }
-}
-
/// Creates a unary expression InList
pub fn in_list(
expr: Arc<dyn PhysicalExpr>,
@@ -438,7 +426,7 @@ pub fn in_list(
let expr_data_type = expr.data_type(schema)?;
for list_expr in list.iter() {
let list_expr_data_type = list_expr.data_type(schema)?;
- if !is_logically_eq(&expr_data_type, &list_expr_data_type) {
+ if !expr_data_type.eq(&list_expr_data_type) {
return internal_err!(
"The data type inlist should be same, the value type is {expr_data_type}, one of list expr type is {list_expr_data_type}"
);
The tests still pass 🤔
$ cargo test -p datafusion-physical-expr -- in_list_utf8_with_dict_types
Finished test [unoptimized + debuginfo] target(s) in 0.09s
Running unittests src/lib.rs (target/debug/deps/datafusion_physical_expr-671316f98d0b9705)
running 1 test
test expressions::in_list::tests::in_list_utf8_with_dict_types ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1574 filtered out; finished in 0.00s
I worry that this means we might break this feature in the future but not know...
@@ -415,6 +415,18 @@ impl PartialEq<dyn Any> for InListExpr { | |||
} | |||
} | |||
|
|||
/// Checks if two types are logically equal, dictionary types are compared by their value types. | |||
fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we could use datatype_is_logically_equal
instead here. It also seems like a good idea to make that function more discoverable -- I also think it would be fine to do as a follow on PR
Nice catch, and thanks for your careful review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @advancedxy -- I still think there is something not right about this PR
I don't think this PR will change any external behavior (see my comments)
Also the sqllogictest
s still pass without the code change
I am beginning to wonder if we really understand the problem or not.
I would suggest starting at this problem from the other end -- can we write a test that causes the error seen in comet (aka do whatever comet is doing). Once we have that then we can change to code to fix it
], | ||
]; | ||
for list in lists.iter() { | ||
in_list_raw!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this change -- the tests now call in_list_raw
directly. But there is no way that in_list_raw
will be called outside of this module 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there is no way that in_list_raw will be called outside of this module
The function in_list
is public, so it's possible that in_list
is called directly without any type coercion, which is exactly what Comet tries to do in the first place to leverage the static filter optimization. Since there's no type coercion, it's possible the value type and the list type are mixed with dictionary types. I think this test simulates exactly the issue I encountered in the Comet's case.
For the sqllogictest part, jayzhan11 has already pointed out in #9530 (comment), the type coercion is happened in the optimization phase.
In datafusion, we have done it in the optimization step, when we reach in_list here, we can ensure the types are consistent, so we just go ahead without type checking.
I think that's why the E2E tests already pass without this PR's fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it -- what I don't understand is how these validate the Comet use case. I expect them to call in_list
(instead they calling the in_list_raw!
macro)
What I was expected to see was a test that mirrors what comet does: call in_list
with a Dictionary column but string literals (that haven't ben type cerced).
Given this case current errors, we have no test coverage, even if the in_list
implementation does actually support it.
Sorry to be so pedantic about this, but I think it is somewhat subtle so making sure we get it right (and don't accidentally break it in the future) I think is important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it -- what I don't understand is how these validate the Comet use case. I expect them to call in_list (instead they calling the in_list_raw! macro)
There might be some misunderstanding here. in_list
and in_list_raw
are both test macros, the main difference is that the former performs type coercion first and the latter does not. These two macros both call the in_list
method though. The comet case calls the in_list
method(not the test macro) directly without type coercion, which is exactly the in_list_utf8_with_dict_types
test trying to simulate by calling the in_list_raw
test macro.
Sorry to be so pedantic about this, but I think it is somewhat subtle so making sure we get it right (and don't accidentally break it in the future) I think is important
No worries. I think it's exactly the spirit we need towards a better software engineering practice. And totally agree that it's important to make sure we get it right and won't break it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not.
Ahh! Yes I was missing exactly this point. Sorry about that. Makes total sense.
Thanks for your feedback and suggestions. I replied to your comments, which I think should explain your concerns and the current test in this PR should already do what Comet is trying to do. Please let me know if you have any further questions ior concerns. Thanks again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @advancedxy -- yes I was confused about the macro. Thanks for bearing with me !
], | ||
]; | ||
for list in lists.iter() { | ||
in_list_raw!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not.
Ahh! Yes I was missing exactly this point. Sorry about that. Makes total sense.
in_list
physical plans
Which issue does this PR close?
Closes #9530.
Rationale for this change
Dictionary types are already supported in the
InListExpr
, however, thein_list
func doesn't handle types beingmixed in with Dictionary types.
What changes are included in this PR?
Relax type check in the
in_list
function.Are these changes tested?
Added new test
Are there any user-facing changes?
No