Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of unnest even more #6961

Closed
alamb opened this issue Jul 13, 2023 · 5 comments · Fixed by #7371
Closed

Improve performance of unnest even more #6961

alamb opened this issue Jul 13, 2023 · 5 comments · Fixed by #7371
Labels
performance Make DataFusion faster

Comments

@alamb
Copy link
Contributor

alamb commented Jul 13, 2023

Basically the Unest exec plan could be made faster if we reduced some copies. Here is the basic idea in case anyone wants to do that

    // Create an array with the unnested values of the list array, given the list
    // array:
    //
    //   [1], null, [2, 3, 4], null, [5, 6]
    //
    // the result array is:
    //
    //   1, null, 2, 3, 4, null, 5, 6
    //
    let unnested_array = unnest_array(list_array)?;

This looks very much the same to me as calling list_array.values() to get access to the underlying values: https://docs.rs/arrow/latest/arrow/array/struct.GenericListArray.html#method.values

In this case the values array would be more like

[1, 2, 3, 4, 5, 6]

And the offsets of the list array would be would be like (I think):

[0, 1, 1, 3, 3, 6]

With a null mask showing the second and fourth element are null

So I was thinking you could calculate the take indices directly from the offsets / nulls without having to copy all the values out of the underlying array

Originally posted by @alamb in #6903 (comment)

@alamb alamb added the performance Make DataFusion faster label Jul 13, 2023
@Dandandan Dandandan changed the title Improve performance of unest even more Improve performance of unnest even more Jul 13, 2023
@jhorstmann
Copy link
Contributor

One interesting special case is that if the list array does not have any nulls at all, then list_array.values() could be returned directly, without any take indices.

@smiklos
Copy link
Contributor

smiklos commented Jul 16, 2023

Is this a good first issue? Would it be possible to take it?

@alamb
Copy link
Contributor Author

alamb commented Jul 16, 2023

Is this a good first issue? Would it be possible to take it?

Hi @smiklos -- thanks!

I think this would be a reasonable first issue if you are willing to learn more about how arrow ListArrays work. THe current code is well documented and tested so I think it would be good.

This work would effectively be to change the calculation of the take offsets

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2023

FWIW I was making some ascii art that I wanted to share on this ticket:

Given this setup:

                                        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
                                                                ┌ ─ ─ ─ ─ ─ ─ ┐    │
 ┌─────────────┐  ┌───────┐             │     ┌───┐   ┌───┐       ┌───┐ ┌───┐       
 │   [A,B,C]   │  │ (0,3) │                   │ 1 │   │ 0 │     │ │ 1 │ │ A │ │ 0  │
 ├─────────────┤  ├───────┤             │     ├───┤   ├───┤       ├───┤ ├───┤       
 │ [] (empty)  │  │ (3,3) │                   │ 1 │   │ 3 │     │ │ 1 │ │ B │ │ 1  │
 ├─────────────┤  ├───────┤             │     ├───┤   ├───┤       ├───┤ ├───┤       
 │    NULL     │  │ (3,4) │                   │ 0 │   │ 3 │     │ │ 1 │ │ C │ │ 2  │
 ├─────────────┤  ├───────┤             │     ├───┤   ├───┤       ├───┤ ├───┤       
 │     [D]     │  │ (4,5) │                   │ 1 │   │ 4 │     │ │ 0 │ │ ? │ │ 3  │
 ├─────────────┤  ├───────┤             │     ├───┤   ├───┤       ├───┤ ├───┤       
 │  [NULL, F]  │  │ (5,7) │                   │ 1 │   │ 5 │     │ │ 1 │ │ D │ │ 4  │
 └─────────────┘  └───────┘             │     └───┘   ├───┤       ├───┤ ├───┤       
                                                      │ 7 │     │ │ 0 │ │ ? │ │ 5  │
                                        │  Validity   └───┘       ├───┤ ├───┤       
    Logical       Logical                  (nulls)   Offsets    │ │ 1 │ │ F │ │ 6  │
     Values       Offsets               │                         └───┘ └───┘       
                                                                │    Values   │    │
                (offsets[i],            │   ListArray               (Array)         
               offsets[i+1])                                    └ ─ ─ ─ ─ ─ ─ ┘    │
                                        └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 

We could compute the output of unnest by computing offsets

0
1
2
4
6

And then calling take on the list_array.values() without any need for an intermediate / flattened values array

A
B
C
D
F

@smiklos
Copy link
Contributor

smiklos commented Aug 1, 2023

This make sense but note that FixedSizeListArray works differently. Also, we'll need different take indices for the column getting flattened and the rest that gets expanded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make DataFusion faster
Projects
None yet
3 participants