-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid copying listarray in unset exec #7002
Conversation
@@ -236,21 +236,21 @@ fn build_batch( | |||
match list_array.data_type() { | |||
DataType::List(_) => { | |||
let list_array = list_array.as_any().downcast_ref::<ListArray>().unwrap(); | |||
unnest_batch(batch, schema, column, &list_array) | |||
unnest_batch(batch, schema, column, &list_array, list_array.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.values
seems to not have a common trait so it needs to be passed down here where we have the concrete types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this makes sense
{ | ||
let elem_type = match list_array.data_type() { | ||
let _elem_type = match list_array.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like it's useless, we can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does some validation but perhaps it's checked already before the exec runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does some validation but perhaps it's checked already before the exec runs?
Yes, it was verified in fn list_lengths<T>(list_array: &T)
I've created a benchmark and ran it against main and this branch and the results look great. This branch seems to perform a lot better. should I commit the benchmark to this pr or make a separate one? |
I see one of the tests actually don't pass (anymore?), I'll take a look |
Tests should pass now. Got surprised by the values array behaving differently for FixedSizedLists... |
Thank you @smiklos -- I plan to review this carefully tomorrow |
I recommend a separate PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @smiklos -- this looks neat but I don't really understand how it is faster as it still calls take
twice
It would be great if you could share your benchmark and result with us. If it is faster then I think this PR is good to go.
The test case in dataframe.rs for FixedSizeList is 👨🍳 👌
cc @vincev
let list_lengths = list_lengths(list_array)?; | ||
|
||
// Create the indices for the take kernel and then use those indices to create | ||
// the unnested record batch. | ||
match list_lengths.data_type() { | ||
DataType::Int32 => { | ||
let list_lengths = as_primitive_array::<Int32Type>(&list_lengths)?; | ||
let unnested_array = | ||
unnest_array(list_array, list_array_values, list_lengths)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still struggling to understand the need to call unnest_array
and why the take indexes can't be calculated directly against the original values array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify which take indices you mean?
The ones that were already on main were used to expand the rest of the columns (their values and not the one that is unnested).
So unnest_array is simply needed as we need to change that column (physically unnest it).
The take indices for the rest of the columns could be calculated based on the values array, it was perhaps prettier to have this intermediate state initially. I can try and get rid of list_sizes all together.
@@ -236,21 +236,21 @@ fn build_batch( | |||
match list_array.data_type() { | |||
DataType::List(_) => { | |||
let list_array = list_array.as_any().downcast_ref::<ListArray>().unwrap(); | |||
unnest_batch(batch, schema, column, &list_array) | |||
unnest_batch(batch, schema, column, &list_array, list_array.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this makes sense
I run the benchmark I used before, with this PR is about ~10% faster. This is main (run a few time this is closer to the mean value):
with this PR:
This unnest 5M rows into 250M rows. |
Here are the benchmark and the generator for the data so you can run it and see if you can get the same numbers. |
Here is my branch with the benchmark. https://github.com/smiklos/arrow-datafusion/blob/unnest-benchmark/datafusion/core/benches/unnest_query.rs @vincev it seems you also measure the time it takes to read the parquet file. in my bench I create the values in-memory and saw a lot of speedup (from ~50ms to 10ms or so for larger batches) |
It calls take for each column. It may skip calling take for the column being unnested if there are no null values. (that is a special case but can speed up certain queries even more). There could be another special case if fixed size arrays have only one value per array and no nulls as in that case there's no need for transforming the data. Otherwise for most cases I don't see how we can avoid take |
Looking at recent changes needed to unnest, it's best to wait until #7088 is resolved |
Thank you @smiklos -- I think we'll get work inspired by this PR in soon. Thank you for pushing us along |
Which issue does this PR close?
Closes #6961.
Rationale for this change
To avoid unnecessary copying of arrays, improving performance.
Tbh, I'm not sure these changes make this perform better (can work on some benchmarks), in the end we need the copied/modified array to create the unnested recordbatch and the take indices were already not using the copied data.
There's also the extra optimization mentioned in the comments regarding the case when
list_array
has no null values.What changes are included in this PR?
UnnestExec is changed such that it uses the take kernel instead of copying values manually + using concat.
Are these changes tested?
I've added a unit test to verify calculating the new take indices (specific to list_array and not the other columns in the recordbatch)
The rest should be covered by existing physical plan tests for unnest.
Are there any user-facing changes?
Nope