-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multi column IN lists like (c1, c2) IN ((c1, c2), ,,,)
#6635
Comments
I also prefer the |
I think we could also use the existing The trick in that case would be implementing |
Hi @alamb @mingmwang ,I am interesting in this issue. But I'm not sure about some of the questions.
|
Perhaps it could be represented like |
Sorry, maybe I don't get your point. I think |
I think the idea is that
So a query like Rather than actually implementing a second dimension in
Where the struct function does this:
This would likely be a much simpler implementation and would be faster than the chained or shown above |
Thanks for your patience @alamb. After your explanation, I feel like I understand it a little bit more clearly.
And in_list code is generalized. But StructArray is not comparable in arrow-rs since it is nested. Should we implement compare in datafusion or upstream? For example,
|
I see your point. The limitation appears to be in arrow itself: https://github.com/apache/arrow-rs/blob/7e289134a8d9f46a92a2759a7b2488b17993fd5b/arrow-ord/src/cmp.rs#L202-L204 I think it might make sense to support equality of nested types upstream, though I am not sure about ordering comparsions ( Perhaps it is worth filing a ticket in arrow-rs and proposing adding |
https://github.com/apache/arrow-rs/blob/9a1e8b572d11078e813fffe3d5c7106b6953d58c/arrow-cast/src/cast.rs#L163-L164 |
Is your feature request related to a problem or challenge?
A user on slack was using datafusion to query parquet files from S3: https://the-asf.slack.com/archives/C01QUFS30TD/p1686147411917959
They reported that the following predicate got 1000x slower when it had 100,000 distinct filter values:
However, when it was rewritten as an inlist it went much faster:
However, this rewrite is not general (for example, if
code
,type
oroperation
contain_
characters.SQL supports this type of predicate natively with "multi-column inlists" that look like;
Substrait supports this kind of predicate too, which I take as some evidence it is widely used
https://substrait.io/expressions/specialized_record_expressions/#or-list-equality-expression
Describe the solution you'd like
I would like multi-column predicates to work in DataFusion.
today, they result in an "unimplemented" error:
Here is an example showing this feature working in posgres:
Describe alternatives you've considered
The existing InList structure looks like this: https://docs.rs/datafusion-expr/26.0.0/datafusion_expr/expr/struct.InList.html
I am not sure how best to implement this. One idea is to simply special case multi-inputs, something like
However, my preferred approach would be to support
StructArray
s inInList
and then implement a rewrite frominto
While likely more complicated this approach would then support structs in INLISTs directly which I think will be more and more valuable over time
Additional context
No response
The text was updated successfully, but these errors were encountered: