-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify parquet statistics generation #5183
Conversation
// If only computing chunk-level statistics compute them here, page-level statistics | ||
// are computed in [`Self::write_mini_batch`] and used to update chunk statistics in | ||
// [`Self::add_data_page`] | ||
if self.statistics_enabled == EnabledStatistics::Chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the major change, rather than computing the chunk statistics here, we always compute statistics per-page, unless completely disabled, and just skip writing them to the page as per the change in #5181.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defiantly simpler than #5181 IMO. Having less places that do update_min/update_max(...)
on the column's stats makes it much easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any performance implications of always computing the statistics even when they aren't written to the page?
Though it seems like if we are writing column chunk level statistics anyways we would have to calculate the min/max across all values anyways, so tracking per page shouldn't be any more expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially for byte arrays you will have additional allocations per page, where previously you would have them per chunk. In practice this is extremely unlikely to matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this and it makes sense to me, though I am not a super expert in this area of code
// If only computing chunk-level statistics compute them here, page-level statistics | ||
// are computed in [`Self::write_mini_batch`] and used to update chunk statistics in | ||
// [`Self::add_data_page`] | ||
if self.statistics_enabled == EnabledStatistics::Chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any performance implications of always computing the statistics even when they aren't written to the page?
Though it seems like if we are writing column chunk level statistics anyways we would have to calculate the min/max across all values anyways, so tracking per page shouldn't be any more expensive
} | ||
} | ||
}; | ||
if let Some(min) = min { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This min is only passed in when precomputed statistics are used (write_batch_with_statistics
). Is that correct? I couldn't find another callsite where min
and max
were non zero
Thank you @tustvold |
Which issue does this PR close?
Closes #.
Rationale for this change
Follow up to #5181
What changes are included in this PR?
Are there any user-facing changes?