Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid copying listarray in unset exec #7002

Closed
wants to merge 4 commits into from
Closed

Conversation

smiklos
Copy link
Contributor

@smiklos smiklos commented Jul 17, 2023

Which issue does this PR close?

Closes #6961.

Rationale for this change

To avoid unnecessary copying of arrays, improving performance.

Tbh, I'm not sure these changes make this perform better (can work on some benchmarks), in the end we need the copied/modified array to create the unnested recordbatch and the take indices were already not using the copied data.

There's also the extra optimization mentioned in the comments regarding the case when list_array has no null values.

What changes are included in this PR?

UnnestExec is changed such that it uses the take kernel instead of copying values manually + using concat.

Are these changes tested?

I've added a unit test to verify calculating the new take indices (specific to list_array and not the other columns in the recordbatch)
The rest should be covered by existing physical plan tests for unnest.

Are there any user-facing changes?

Nope

@github-actions github-actions bot added the core Core DataFusion crate label Jul 17, 2023
@@ -236,21 +236,21 @@ fn build_batch(
match list_array.data_type() {
DataType::List(_) => {
let list_array = list_array.as_any().downcast_ref::<ListArray>().unwrap();
unnest_batch(batch, schema, column, &list_array)
unnest_batch(batch, schema, column, &list_array, list_array.values())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.values seems to not have a common trait so it needs to be passed down here where we have the concrete types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this makes sense

{
let elem_type = match list_array.data_type() {
let _elem_type = match list_array.data_type() {
Copy link
Member

@jackwener jackwener Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like it's useless, we can remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does some validation but perhaps it's checked already before the exec runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does some validation but perhaps it's checked already before the exec runs?

Yes, it was verified in fn list_lengths<T>(list_array: &T)

@smiklos
Copy link
Contributor Author

smiklos commented Jul 18, 2023

I've created a benchmark and ran it against main and this branch and the results look great. This branch seems to perform a lot better. should I commit the benchmark to this pr or make a separate one?

@jackwener
Copy link
Member

cc @tustvold @alamb @izveigor

@smiklos
Copy link
Contributor Author

smiklos commented Jul 18, 2023

I see one of the tests actually don't pass (anymore?), I'll take a look

@smiklos
Copy link
Contributor Author

smiklos commented Jul 18, 2023

Tests should pass now. Got surprised by the values array behaving differently for FixedSizedLists...
The code could be improved but waiting for feedback first.

@alamb
Copy link
Contributor

alamb commented Jul 19, 2023

Thank you @smiklos -- I plan to review this carefully tomorrow

@alamb
Copy link
Contributor

alamb commented Jul 20, 2023

I've created a benchmark and ran it against main and this branch and the results look great. This branch seems to perform a lot better. should I commit the benchmark to this pr or make a separate one?

I recommend a separate PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @smiklos -- this looks neat but I don't really understand how it is faster as it still calls take twice

It would be great if you could share your benchmark and result with us. If it is faster then I think this PR is good to go.

The test case in dataframe.rs for FixedSizeList is 👨‍🍳 👌

cc @vincev

let list_lengths = list_lengths(list_array)?;

// Create the indices for the take kernel and then use those indices to create
// the unnested record batch.
match list_lengths.data_type() {
DataType::Int32 => {
let list_lengths = as_primitive_array::<Int32Type>(&list_lengths)?;
let unnested_array =
unnest_array(list_array, list_array_values, list_lengths)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still struggling to understand the need to call unnest_array and why the take indexes can't be calculated directly against the original values array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify which take indices you mean?
The ones that were already on main were used to expand the rest of the columns (their values and not the one that is unnested).

So unnest_array is simply needed as we need to change that column (physically unnest it).
The take indices for the rest of the columns could be calculated based on the values array, it was perhaps prettier to have this intermediate state initially. I can try and get rid of list_sizes all together.

@@ -236,21 +236,21 @@ fn build_batch(
match list_array.data_type() {
DataType::List(_) => {
let list_array = list_array.as_any().downcast_ref::<ListArray>().unwrap();
unnest_batch(batch, schema, column, &list_array)
unnest_batch(batch, schema, column, &list_array, list_array.values())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this makes sense

@vincev
Copy link
Contributor

vincev commented Jul 20, 2023

I run the benchmark I used before, with this PR is about ~10% faster.

This is main (run a few time this is closer to the mean value):

./unnest-main
+-----------+
| points    |
+-----------+
| 247563120 |
+-----------+
Elapsed: 6.602

with this PR:

./unnest-pr  
+-----------+
| points    |
+-----------+
| 247563120 |
+-----------+
Elapsed: 6.039

This unnest 5M rows into 250M rows.

@vincev
Copy link
Contributor

vincev commented Jul 20, 2023

Here are the benchmark and the generator for the data so you can run it and see if you can get the same numbers.

@smiklos
Copy link
Contributor Author

smiklos commented Jul 20, 2023

Here is my branch with the benchmark. https://github.com/smiklos/arrow-datafusion/blob/unnest-benchmark/datafusion/core/benches/unnest_query.rs

@vincev it seems you also measure the time it takes to read the parquet file. in my bench I create the values in-memory and saw a lot of speedup (from ~50ms to 10ms or so for larger batches)

@smiklos
Copy link
Contributor Author

smiklos commented Jul 20, 2023

Thank you very much @smiklos -- this looks neat but I don't really understand how it is faster as it still calls take twice

It would be great if you could share your benchmark and result with us. If it is faster then I think this PR is good to go.

The test case in dataframe.rs for FixedSizeList is 👨‍🍳 👌

cc @vincev

It calls take for each column. It may skip calling take for the column being unnested if there are no null values. (that is a special case but can speed up certain queries even more).

There could be another special case if fixed size arrays have only one value per array and no nulls as in that case there's no need for transforming the data. Otherwise for most cases I don't see how we can avoid take

@smiklos
Copy link
Contributor Author

smiklos commented Jul 27, 2023

Looking at recent changes needed to unnest, it's best to wait until #7088 is resolved

@smiklos smiklos closed this Jul 27, 2023
@alamb
Copy link
Contributor

alamb commented Jul 27, 2023

Thank you @smiklos -- I think we'll get work inspired by this PR in soon. Thank you for pushing us along

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of unnest even more
4 participants