Improve performance of unnest even more #6961

alamb · 2023-07-13T18:03:03Z

Basically the Unest exec plan could be made faster if we reduced some copies. Here is the basic idea in case anyone wants to do that

    // Create an array with the unnested values of the list array, given the list
    // array:
    //
    //   [1], null, [2, 3, 4], null, [5, 6]
    //
    // the result array is:
    //
    //   1, null, 2, 3, 4, null, 5, 6
    //
    let unnested_array = unnest_array(list_array)?;

This looks very much the same to me as calling list_array.values() to get access to the underlying values: https://docs.rs/arrow/latest/arrow/array/struct.GenericListArray.html#method.values

In this case the values array would be more like

[1, 2, 3, 4, 5, 6]

And the offsets of the list array would be would be like (I think):

[0, 1, 1, 3, 3, 6]

With a null mask showing the second and fourth element are null

So I was thinking you could calculate the take indices directly from the offsets / nulls without having to copy all the values out of the underlying array

Originally posted by @alamb in #6903 (comment)

The text was updated successfully, but these errors were encountered:

jhorstmann · 2023-07-14T12:05:36Z

One interesting special case is that if the list array does not have any nulls at all, then list_array.values() could be returned directly, without any take indices.

smiklos · 2023-07-16T08:58:22Z

Is this a good first issue? Would it be possible to take it?

alamb · 2023-07-16T09:53:46Z

Is this a good first issue? Would it be possible to take it?

Hi @smiklos -- thanks!

I think this would be a reasonable first issue if you are willing to learn more about how arrow ListArrays work. THe current code is well documented and tested so I think it would be good.

This work would effectively be to change the calculation of the take offsets

alamb · 2023-07-31T20:29:52Z

FWIW I was making some ascii art that I wanted to share on this ticket:

Given this setup:

                                        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
                                                                ┌ ─ ─ ─ ─ ─ ─ ┐    │
 ┌─────────────┐  ┌───────┐             │     ┌───┐   ┌───┐       ┌───┐ ┌───┐       
 │   [A,B,C]   │  │ (0,3) │                   │ 1 │   │ 0 │     │ │ 1 │ │ A │ │ 0  │
 ├─────────────┤  ├───────┤             │     ├───┤   ├───┤       ├───┤ ├───┤       
 │ [] (empty)  │  │ (3,3) │                   │ 1 │   │ 3 │     │ │ 1 │ │ B │ │ 1  │
 ├─────────────┤  ├───────┤             │     ├───┤   ├───┤       ├───┤ ├───┤       
 │    NULL     │  │ (3,4) │                   │ 0 │   │ 3 │     │ │ 1 │ │ C │ │ 2  │
 ├─────────────┤  ├───────┤             │     ├───┤   ├───┤       ├───┤ ├───┤       
 │     [D]     │  │ (4,5) │                   │ 1 │   │ 4 │     │ │ 0 │ │ ? │ │ 3  │
 ├─────────────┤  ├───────┤             │     ├───┤   ├───┤       ├───┤ ├───┤       
 │  [NULL, F]  │  │ (5,7) │                   │ 1 │   │ 5 │     │ │ 1 │ │ D │ │ 4  │
 └─────────────┘  └───────┘             │     └───┘   ├───┤       ├───┤ ├───┤       
                                                      │ 7 │     │ │ 0 │ │ ? │ │ 5  │
                                        │  Validity   └───┘       ├───┤ ├───┤       
    Logical       Logical                  (nulls)   Offsets    │ │ 1 │ │ F │ │ 6  │
     Values       Offsets               │                         └───┘ └───┘       
                                                                │    Values   │    │
                (offsets[i],            │   ListArray               (Array)         
               offsets[i+1])                                    └ ─ ─ ─ ─ ─ ─ ┘    │
                                        └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

We could compute the output of unnest by computing offsets

And then calling take on the list_array.values() without any need for an intermediate / flattened values array

A
B
C
D
F

smiklos · 2023-08-01T13:24:53Z

This make sense but note that FixedSizeListArray works differently. Also, we'll need different take indices for the column getting flattened and the rest that gets expanded.

alamb mentioned this issue Jul 13, 2023

Improve unnest_column performance #6903

Merged

alamb added the performance Make DataFusion faster label Jul 13, 2023

Dandandan changed the title ~~Improve performance of unest even more~~ Improve performance of unnest even more Jul 13, 2023

izveigor mentioned this issue Jul 17, 2023

General ticket for Array/List data type #6863

Open

smiklos mentioned this issue Jul 17, 2023

avoid copying listarray in unset exec #7002

Closed

smiklos mentioned this issue Aug 1, 2023

Make unnest consistent with DuckDB/ClickHouse, add option for preserve_nulls, update docs #7088

Closed

smiklos mentioned this issue Aug 22, 2023

Optimize Unnest and implement skip_nulls=true if specified #7371

Merged

1 task

alamb closed this as completed in #7371 Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of unnest even more #6961

Improve performance of unnest even more #6961

alamb commented Jul 13, 2023 •

edited by Dandandan

Loading

jhorstmann commented Jul 14, 2023

smiklos commented Jul 16, 2023

alamb commented Jul 16, 2023

alamb commented Jul 31, 2023 •

edited

Loading

smiklos commented Aug 1, 2023

Improve performance of unnest even more #6961

Improve performance of unnest even more #6961

Comments

alamb commented Jul 13, 2023 • edited by Dandandan Loading

jhorstmann commented Jul 14, 2023

smiklos commented Jul 16, 2023

alamb commented Jul 16, 2023

alamb commented Jul 31, 2023 • edited Loading

smiklos commented Aug 1, 2023

alamb commented Jul 13, 2023 •

edited by Dandandan

Loading

alamb commented Jul 31, 2023 •

edited

Loading