-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache row group column stats in ORC column writer #13262
Conversation
presto-orc/src/main/java/com/facebook/presto/orc/writer/DecimalColumnWriter.java
Outdated
Show resolved
Hide resolved
Comments addressed; thanks for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how these classes are used. Should we reset the columnStatisticsRetainedSizeInBytes
in reset()
?
An ORC file can be large containing thousands of row groups. In some cases, 50% CPU is spent on getting the retained size for column writers. Cache the row group stats to save CPU.
@rongrong, good catch! Fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We found huge regression in prod due to column stats calculation when writing large files
An ORC file can be large containing thousands of row groups. In some
cases, 50% CPU is spent on getting the retained size for column writers.
Cache the row group stats to save CPU.
Please fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.