-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix compute_record_batch_statistics
wrong with projection
#8489
Conversation
# Conflicts: # datafusion/physical-plan/src/joins/hash_join_utils.rs
# Conflicts: # datafusion/physical-plan/src/joins/hash_join_utils.rs
.iter() | ||
.flatten() | ||
.map(|b| { | ||
b.columns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RecordBatch.project
method is not used, because there is a clone
internally, so there is no need to generate a new RecordBatch here.
num_rows: Precision::Exact(3), | ||
total_byte_size: Precision::Exact(464), // this might change a bit if the way we compute the size changes | ||
total_byte_size: Precision::Exact(byte_size), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is appropriate, if you have any good suggestions please leave a message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok and a nice way to make the code less brittle to future changes in arrow's layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious as to why the previous code was Precision::Exact(464)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it happens to be the (current) size of the record batch in the test:
let batch = RecordBatch::try_new(
Arc::clone(&schema),
vec![
Arc::new(Float32Array::from(vec![1., 2., 3.])),
Arc::new(Float64Array::from(vec![9., 8., 7.])),
Arc::new(UInt64Array::from(vec![4, 5, 6])),
],
)?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me -- thank you @Asura7969 and I apologize for the delay in reviewing. cc @Dandandan
num_rows: Precision::Exact(3), | ||
total_byte_size: Precision::Exact(464), // this might change a bit if the way we compute the size changes | ||
total_byte_size: Precision::Exact(byte_size), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok and a nice way to make the code less brittle to future changes in arrow's layout
--------------------CoalesceBatchesExec: target_batch_size=8192 | ||
----------------------RepartitionExec: partitioning=Hash([col0@0], 4), input_partitions=1 | ||
------------------------MemoryExec: partitions=1, partition_sizes=[3] | ||
----------------ProjectionExec: expr=[col0@2 as col0, col1@3 as col1, col2@4 as col2, col0@0 as col0, col1@1 as col1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks to me like the change is due to the fact the join inputs were reordered and this projection puts the columns back the expected way.
Same thing with the projection below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Asura7969
let total_byte_size = batches | ||
.iter() | ||
.flatten() | ||
.map(|b| { | ||
projection | ||
.iter() | ||
.map(|index| b.column(*index).get_array_memory_size()) | ||
.sum::<usize>() | ||
}) | ||
.sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
num_rows: Precision::Exact(3), | ||
total_byte_size: Precision::Exact(464), // this might change a bit if the way we compute the size changes | ||
total_byte_size: Precision::Exact(byte_size), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it happens to be the (current) size of the record batch in the test:
let batch = RecordBatch::try_new(
Arc::clone(&schema),
vec![
Arc::new(Float32Array::from(vec![1., 2., 3.])),
Arc::new(Float64Array::from(vec![9., 8., 7.])),
Arc::new(UInt64Array::from(vec![4, 5, 6])),
],
)?;
Thanks again @Asura7969 @ |
Thanks @Asura7969 and @alamb @jackwener for the review |
Which issue does this PR close?
Closes #8406.
Rationale for this change
Projection size is not calculated correctly
What changes are included in this PR?
Select the correct columns, not all
Are these changes tested?
test_compute_record_batch_statistics
Are there any user-facing changes?