Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache row group column stats in ORC column writer #13262

Merged
merged 1 commit into from
Aug 23, 2019

Conversation

highker
Copy link
Contributor

@highker highker commented Aug 21, 2019

We found huge regression in prod due to column stats calculation when writing large files

An ORC file can be large containing thousands of row groups. In some
cases, 50% CPU is spent on getting the retained size for column writers.
Cache the row group stats to save CPU.

Please fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

Hive Changes
* Fix high CPU usage when writing ORC files with too many row groups.

@highker
Copy link
Contributor Author

highker commented Aug 21, 2019

Comments addressed; thanks for review.

Copy link
Contributor

@jessesleeping jessesleeping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@rongrong rongrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how these classes are used. Should we reset the columnStatisticsRetainedSizeInBytes in reset()?

An ORC file can be large containing thousands of row groups. In some
cases, 50% CPU is spent on getting the retained size for column writers.
Cache the row group stats to save CPU.
@highker
Copy link
Contributor Author

highker commented Aug 23, 2019

@rongrong, good catch! Fixed

@highker highker requested a review from rongrong August 23, 2019 18:10
Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@highker highker merged commit 6e220a1 into prestodb:master Aug 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants