-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix null_count
on compute_record_batch_statistics
to report null counts across partitions
#10468
fix null_count
on compute_record_batch_statistics
to report null counts across partitions
#10468
Conversation
|
||
for partition in batches.iter() { | ||
for batch in partition { | ||
for (stat_index, col_index) in projection.iter().enumerate() { | ||
column_statistics[stat_index].null_count = | ||
Precision::Exact(batch.column(*col_index).null_count()); | ||
null_counts[stat_index] += batch.column(*col_index).null_count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There would be fewer bounds checks i think if we used zip
here, rather than .enumerate()
and lookup each column
null_count
on compute_record_batch_statistics
null_count
on compute_record_batch_statistics
to report null counts across partitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @samuelcolvin
Looks like a very good improvement to me
], | ||
}; | ||
|
||
assert_eq!(actual, expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified test coverage by running this test without the code in this PR and the test fails like this
assertion `left == right` failed
left: Statistics { num_rows: Exact(6), total_byte_size: Exact(368), column_statistics: [ColumnStatistics { null_count: Exact(1), max_value: Absent, min_value: Absent, distinct_count: Absent }] }
right: Statistics { num_rows: Exact(6), total_byte_size: Exact(368), column_statistics: [ColumnStatistics { null_count: Exact(3), max_value: Absent, min_value: Absent, distinct_count: Absent }] }
I took the liberty of merging up from main and runnning |
Thanks again @samuelcolvin -- sorry for the delay in merging |
…counts across partitions (apache#10468) * fix null_count on compute_record_batch_statistics * fmt --------- Co-authored-by: Andrew Lamb <[email protected]>
Rationale for this change
Maybe I'm missing something or being dumb, but while reading
datafusion::physical_plan::common::compute_record_batch_statistics
I noticed that thenull_count
took the value from the last partition, not the sum of all partition.What changes are included in this PR?
compute_record_batch_statistics
Are these changes tested?
yes
Are there any user-facing changes?
I don't think so, perhaps some queries are now correct that were previously incorrect.