avoid copying listarray in unset exec #7002

smiklos · 2023-07-17T15:17:13Z

Which issue does this PR close?

Closes #6961.

Rationale for this change

To avoid unnecessary copying of arrays, improving performance.

Tbh, I'm not sure these changes make this perform better (can work on some benchmarks), in the end we need the copied/modified array to create the unnested recordbatch and the take indices were already not using the copied data.

There's also the extra optimization mentioned in the comments regarding the case when list_array has no null values.

What changes are included in this PR?

UnnestExec is changed such that it uses the take kernel instead of copying values manually + using concat.

Are these changes tested?

I've added a unit test to verify calculating the new take indices (specific to list_array and not the other columns in the recordbatch)
The rest should be covered by existing physical plan tests for unnest.

Are there any user-facing changes?

Nope

smiklos · 2023-07-17T15:18:24Z

datafusion/core/src/physical_plan/unnest.rs

@@ -236,21 +236,21 @@ fn build_batch(
    match list_array.data_type() {
        DataType::List(_) => {
            let list_array = list_array.as_any().downcast_ref::<ListArray>().unwrap();
-            unnest_batch(batch, schema, column, &list_array)
+            unnest_batch(batch, schema, column, &list_array, list_array.values())


.values seems to not have a common trait so it needs to be passed down here where we have the concrete types

Yes, this makes sense

jackwener · 2023-07-18T08:03:02Z

datafusion/core/src/physical_plan/unnest.rs

 {
-    let elem_type = match list_array.data_type() {
+    let _elem_type = match list_array.data_type() {


Look like it's useless, we can remove it.

it does some validation but perhaps it's checked already before the exec runs?

it does some validation but perhaps it's checked already before the exec runs?

Yes, it was verified in fn list_lengths<T>(list_array: &T)

smiklos · 2023-07-18T11:54:47Z

I've created a benchmark and ran it against main and this branch and the results look great. This branch seems to perform a lot better. should I commit the benchmark to this pr or make a separate one?

jackwener · 2023-07-18T12:12:41Z

cc @tustvold @alamb @izveigor

smiklos · 2023-07-18T13:32:38Z

I see one of the tests actually don't pass (anymore?), I'll take a look

smiklos · 2023-07-18T20:47:38Z

Tests should pass now. Got surprised by the values array behaving differently for FixedSizedLists...
The code could be improved but waiting for feedback first.

alamb · 2023-07-19T20:52:49Z

Thank you @smiklos -- I plan to review this carefully tomorrow

alamb · 2023-07-20T18:01:25Z

I've created a benchmark and ran it against main and this branch and the results look great. This branch seems to perform a lot better. should I commit the benchmark to this pr or make a separate one?

I recommend a separate PR

alamb

Thank you very much @smiklos -- this looks neat but I don't really understand how it is faster as it still calls take twice

It would be great if you could share your benchmark and result with us. If it is faster then I think this PR is good to go.

The test case in dataframe.rs for FixedSizeList is 👨‍🍳 👌

cc @vincev

alamb · 2023-07-20T18:18:05Z

datafusion/core/src/physical_plan/unnest.rs

    let list_lengths = list_lengths(list_array)?;

    // Create the indices for the take kernel and then use those indices to create
    // the unnested record batch.
    match list_lengths.data_type() {
        DataType::Int32 => {
            let list_lengths = as_primitive_array::<Int32Type>(&list_lengths)?;
+            let unnested_array =
+                unnest_array(list_array, list_array_values, list_lengths)?;


I am still struggling to understand the need to call unnest_array and why the take indexes can't be calculated directly against the original values array.

Can you clarify which take indices you mean?
The ones that were already on main were used to expand the rest of the columns (their values and not the one that is unnested).

So unnest_array is simply needed as we need to change that column (physically unnest it).
The take indices for the rest of the columns could be calculated based on the values array, it was perhaps prettier to have this intermediate state initially. I can try and get rid of list_sizes all together.

alamb · 2023-07-20T18:18:19Z

datafusion/core/src/physical_plan/unnest.rs

@@ -236,21 +236,21 @@ fn build_batch(
    match list_array.data_type() {
        DataType::List(_) => {
            let list_array = list_array.as_any().downcast_ref::<ListArray>().unwrap();
-            unnest_batch(batch, schema, column, &list_array)
+            unnest_batch(batch, schema, column, &list_array, list_array.values())


Yes, this makes sense

vincev · 2023-07-20T18:43:59Z

I run the benchmark I used before, with this PR is about ~10% faster.

This is main (run a few time this is closer to the mean value):

./unnest-main
+-----------+
| points    |
+-----------+
| 247563120 |
+-----------+
Elapsed: 6.602

with this PR:

./unnest-pr  
+-----------+
| points    |
+-----------+
| 247563120 |
+-----------+
Elapsed: 6.039

This unnest 5M rows into 250M rows.

vincev · 2023-07-20T18:49:54Z

Here are the benchmark and the generator for the data so you can run it and see if you can get the same numbers.

smiklos · 2023-07-20T19:39:54Z

Here is my branch with the benchmark. https://github.com/smiklos/arrow-datafusion/blob/unnest-benchmark/datafusion/core/benches/unnest_query.rs

@vincev it seems you also measure the time it takes to read the parquet file. in my bench I create the values in-memory and saw a lot of speedup (from ~50ms to 10ms or so for larger batches)

smiklos · 2023-07-20T19:50:19Z

Thank you very much @smiklos -- this looks neat but I don't really understand how it is faster as it still calls take twice

It would be great if you could share your benchmark and result with us. If it is faster then I think this PR is good to go.

The test case in dataframe.rs for FixedSizeList is 👨‍🍳 👌

cc @vincev

It calls take for each column. It may skip calling take for the column being unnested if there are no null values. (that is a special case but can speed up certain queries even more).

There could be another special case if fixed size arrays have only one value per array and no nulls as in that case there's no need for transforming the data. Otherwise for most cases I don't see how we can avoid take

smiklos · 2023-07-27T10:45:16Z

Looking at recent changes needed to unnest, it's best to wait until #7088 is resolved

alamb · 2023-07-27T11:16:41Z

Thank you @smiklos -- I think we'll get work inspired by this PR in soon. Thank you for pushing us along

avoid copying listarray in unset exec

95d9901

github-actions bot added the core Core DataFusion crate label Jul 17, 2023

smiklos commented Jul 17, 2023

View reviewed changes

jackwener reviewed Jul 18, 2023

View reviewed changes

remove redundant type validation

3d1cfb3

special case for FixedSizeList

2ec5fcc

optimization and test

b8c29ea

alamb reviewed Jul 20, 2023

View reviewed changes

alamb mentioned this pull request Jul 20, 2023

Docs and a test for unnest / Proposed consistent API #7044

Closed

smiklos mentioned this pull request Jul 25, 2023

Make unnest consistent with DuckDB/ClickHouse, add option for preserve_nulls, update docs #7088

Closed

smiklos closed this Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid copying listarray in unset exec #7002

avoid copying listarray in unset exec #7002

smiklos commented Jul 17, 2023 •

edited

Loading

smiklos Jul 17, 2023

alamb Jul 20, 2023

jackwener Jul 18, 2023 •

edited

Loading

smiklos Jul 18, 2023

smiklos Jul 18, 2023

jackwener Jul 18, 2023

smiklos commented Jul 18, 2023

jackwener commented Jul 18, 2023

smiklos commented Jul 18, 2023

smiklos commented Jul 18, 2023

alamb commented Jul 19, 2023

alamb commented Jul 20, 2023

alamb left a comment

alamb Jul 20, 2023

smiklos Jul 20, 2023

alamb Jul 20, 2023

vincev commented Jul 20, 2023

vincev commented Jul 20, 2023

smiklos commented Jul 20, 2023

smiklos commented Jul 20, 2023

smiklos commented Jul 27, 2023

alamb commented Jul 27, 2023

avoid copying listarray in unset exec #7002

avoid copying listarray in unset exec #7002

Conversation

smiklos commented Jul 17, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackwener Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smiklos commented Jul 18, 2023

jackwener commented Jul 18, 2023

smiklos commented Jul 18, 2023

smiklos commented Jul 18, 2023

alamb commented Jul 19, 2023

alamb commented Jul 20, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincev commented Jul 20, 2023

vincev commented Jul 20, 2023

smiklos commented Jul 20, 2023

smiklos commented Jul 20, 2023

smiklos commented Jul 27, 2023

alamb commented Jul 27, 2023

smiklos commented Jul 17, 2023 •

edited

Loading

jackwener Jul 18, 2023 •

edited

Loading