Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` #6635

alamb · 2023-06-11T13:11:05Z

Is your feature request related to a problem or challenge?

A user on slack was using datafusion to query parquet files from S3: https://the-asf.slack.com/archives/C01QUFS30TD/p1686147411917959

They reported that the following predicate got 1000x slower when it had 100,000 distinct filter values:

WHERE
(rate.code = 'SC1' and rate.type = 'UT1' and rate.operation = 'RO1') 
OR 
(rate.code = 'SC2' and rate.type = 'UT2' and rate.operation = 'RO2')
...

However, when it was rewritten as an inlist it went much faster:

WHERE (rate.code || '_' || rate.type || '_' || rate.operation) IN ('SC1_UT1_RO1', 'SC2_UT2_RO2', ...)

However, this rewrite is not general (for example, if code, type or operation contain _ characters.

SQL supports this type of predicate natively with "multi-column inlists" that look like;

WHERE (rate.code,  rate.type, rate.operation) IN (('SC1', 'UT1', 'RO1'), ('SC2', 'UT2', 'RO2') ...)

Substrait supports this kind of predicate too, which I take as some evidence it is widely used

https://substrait.io/expressions/specialized_record_expressions/#or-list-equality-expression

> A specialized structure that is often used is a large list of possible values. In SQL, these are typically large IN lists. They can be composed from one or more fields. There are two common patterns, single value and multi value. In pseudocode they are represented as:

Describe the solution you'd like

I would like multi-column predicates to work in DataFusion.

today, they result in an "unimplemented" error:

❯ create table foo (x int, y varchar);
0 rows in set. Query took 0.001 seconds.
❯ insert into foo values (1, 'a'), (2, 'b'), (3, 'c');
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.002 seconds.
❯ select * from foo where (x, y) IN ((1,'a'), (2, 'b'), (5, 'e'));
This feature is not implemented: Unsupported ast node in sqltorel: Tuple([Value(Number("1", false)), Value(SingleQuotedString("a"))])

Here is an example showing this feature working in posgres:

postgres=# create table foo (x int, y varchar);
CREATE TABLE
postgres=# insert into foo values (1, 'a'), (2, 'b'), (3, 'c');
INSERT 0 3
postgres=# select * from foo;
 x | y
---+---
 1 | a
 2 | b
 3 | c
(3 rows)
postgres=# select * from foo where (x, y) IN ((1,'a'), (2, 'b'), (5, 'e'));
 x | y
---+---
 1 | a
 2 | b
(2 rows)

Describe alternatives you've considered

The existing InList structure looks like this: https://docs.rs/datafusion-expr/26.0.0/datafusion_expr/expr/struct.InList.html

pub struct InList {
    pub expr: Box<Expr>,
    pub list: Vec<Expr>,
    pub negated: bool,
}

I am not sure how best to implement this. One idea is to simply special case multi-inputs, something like

struct MultIInList {
    pub exprs: Vec<Expr>,
    pub lists: Vec<Vec<Expr>>,
    pub negated: bool,
}

However, my preferred approach would be to support StructArrays in InList and then implement a rewrite from

(col1, col2, ...) in (('a1', 'a2', ..), ('b1', 'b2', ..), ...)

into

struct(col1, col2) In ({ col1: 'a', col2: 'a2', ..}, {col1: 'b1', col2: 'b2', ..}, ...)

While likely more complicated this approach would then support structs in INLISTs directly which I think will be more and more valuable over time

Additional context

No response

The text was updated successfully, but these errors were encountered:

mingmwang · 2023-06-13T17:02:36Z

I also prefer the StructArrays approach. We can add a new type of Expr called StructArrays

alamb · 2023-06-13T20:47:32Z

We can add a new type of Expr called StructArrays

I think we could also use the existing ScalarValue like this : https://docs.rs/datafusion/latest/datafusion/scalar/enum.ScalarValue.html#variant.Struct

The trick in that case would be implementing StructArray support in https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-expr/src/expressions/in_list.rs -- @tustvold said on slack "it would be straightforward" 😆 which it might be for him, I don't think it would be a good first issue

my-vegetable-has-exploded · 2023-12-03T15:04:28Z

Hi @alamb @mingmwang ,I am interesting in this issue. But I'm not sure about some of the questions.

('a1', 'a2', ..) will be resolved as Tuple([Value('a'), Value('b)], ....)(which can be handled as ScalarValue(struct)), and (col1, col2, ...) will be Tuple([Identifier("col1"), Identifier("col1"), ......]), I don't know how represent Tuple(Vec<Column>) in logical_expr? Do we need to add a new type of Expr for it?

alamb · 2023-12-06T21:16:40Z

Tuple(Vec)

Perhaps it could be represented like ScalarValue::List(..) 🤔

my-vegetable-has-exploded · 2023-12-07T04:57:28Z

Perhaps it could be represented like ScalarValue::List(..) 🤔

Sorry, maybe I don't get your point. I think (col1, col2, ...) likes multi-column projection from table rather than scalarvalue, so I don't think it can be be represented like ScalarValue::List(..).

alamb · 2023-12-07T21:07:34Z

I think the idea is that

col1	col2
A	1
B	2
C	3

So a query like SELECT * FROM t WHERE (col1, col2) IN (('A', 1), ('B', 2))

Rather than actually implementing a second dimension in InExpr we could potentially keep things simpler with a rewrite like this:

SELECT * 
FROM t 
WHERE (struct(col1, col2)) IN (struct('A', 1)), struct ('B', 2))`

Where the struct function does this:

DataFusion CLI v33.0.0
❯ select struct('A', 1);
+----------------------------+
| struct(Utf8("A"),Int64(1)) |
+----------------------------+
| {c0: A, c1: 1}             |
+----------------------------+

This would likely be a much simpler implementation and would be faster than the chained or shown above

my-vegetable-has-exploded · 2023-12-08T08:59:07Z

Thanks for your patience @alamb. After your explanation, I feel like I understand it a little bit more clearly.

Tuple([Value('a'), Value('b)], ....) can be handled as ScalarValue(struct)， struct(col1, col2) will be BuiltinScalarFunction::Struct whose args are col1 and col2 ...

And in_list code is generalized.

https://github.com/apache/arrow-datafusion/blob/c0c9e8888878c5d7f2586cf605702430c94ea425/datafusion/physical-expr/src/expressions/in_list.rs#L360-L363

But StructArray is not comparable in arrow-rs since it is nested. Should we implement compare in datafusion or upstream?

For example,

❯ CREATE TABLE colors (
    color_id INT PRIMARY KEY,
    color_name VARCHAR(50)
);
INSERT INTO colors (color_id, color_name) VALUES (1, 'Red'), (2, 'Blue');

❯ SELECT * FROM colors WHERE struct(color_id) IN (struct(arrow_cast(1,'Int32')));
Arrow error: Invalid argument error: Invalid comparison operation: Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) == Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])

alamb · 2023-12-08T18:40:55Z

But StructArray is not comparable in arrow-rs since it is nested. Should we implement compare in datafusion or upstream?

I see your point.

The limitation appears to be in arrow itself: https://github.com/apache/arrow-rs/blob/7e289134a8d9f46a92a2759a7b2488b17993fd5b/arrow-ord/src/cmp.rs#L202-L204

I think it might make sense to support equality of nested types upstream, though I am not sure about ordering comparsions (<, <=, etc)

Perhaps it is worth filing a ticket in arrow-rs and proposing adding StructArray equality comparison support to eq and not_eq 🤔

my-vegetable-has-exploded · 2023-12-17T15:10:54Z

https://github.com/apache/arrow-rs/blob/9a1e8b572d11078e813fffe3d5c7106b6953d58c/arrow-cast/src/cast.rs#L163-L164
The cast between structarrays also needs to support, I think these are the inherent problems of nested types. I'd like complete this ticket when these are all done in upstream.

alamb added the enhancement New feature or request label Jun 11, 2023

my-vegetable-has-exploded mentioned this issue Dec 11, 2023

Support equality of StructArray apache/arrow-rs#5199

Closed

This was referenced Dec 18, 2023

Support cast between StructArray. apache/arrow-rs#5219

Closed

Support multi column IN lists #8590

Closed

samuelcolvin mentioned this issue Aug 8, 2024

Support tuples as types #11896

Merged

alamb closed this as completed in #11896 Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` #6635

Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` #6635

alamb commented Jun 11, 2023

mingmwang commented Jun 13, 2023

alamb commented Jun 13, 2023

my-vegetable-has-exploded commented Dec 3, 2023 •

edited

Loading

alamb commented Dec 6, 2023

my-vegetable-has-exploded commented Dec 7, 2023

alamb commented Dec 7, 2023

my-vegetable-has-exploded commented Dec 8, 2023

alamb commented Dec 8, 2023

my-vegetable-has-exploded commented Dec 17, 2023

Support multi column IN lists like (c1, c2) IN ((c1, c2), ,,,) #6635

Support multi column IN lists like (c1, c2) IN ((c1, c2), ,,,) #6635

Comments

alamb commented Jun 11, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

mingmwang commented Jun 13, 2023

alamb commented Jun 13, 2023

my-vegetable-has-exploded commented Dec 3, 2023 • edited Loading

alamb commented Dec 6, 2023

my-vegetable-has-exploded commented Dec 7, 2023

alamb commented Dec 7, 2023

my-vegetable-has-exploded commented Dec 8, 2023

alamb commented Dec 8, 2023

my-vegetable-has-exploded commented Dec 17, 2023

Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` #6635

Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` #6635

my-vegetable-has-exploded commented Dec 3, 2023 •

edited

Loading