-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add session property to configure ORC rowGroupMaxRowCount in Iceberg #23723
Add session property to configure ORC rowGroupMaxRowCount in Iceberg #23723
Conversation
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe a bit how you intend to tune this ?
I can imagine that having bigger stripe size/rowcount could be better for tables with large number of columns. But it's not clear to me what heuristic we would use to tune row-group row count. This is applied per-column, so a particular choice may be better for certain data types and worse for others. A larger value should be better for compression ratios but potentially worse for filtering (row-group min/max may be have reduced selectivity) and would increase memory requirements of the orc reader (it has to keep one decompressed row-group in memory per column in the query per split).
We have a very large table, where users often make small, targeted queries with a filter on the same one or two Our files are not sorted on these columns, so min/max for these row groups is already nearly useless for these queries – a row group of size 10,000 might have a min/max of Bloom filters are created at the row-group level, i.e. "is this element in this set of We expect that this change will allow us to significantly reduce the amount of data scanned per query, and we expect the performance gains there to outweigh any additional overhead introduced by this larger row group size. But we don't want to make this change at the catalog level, which would affect all of our writes to various tables within this catalog – instead, we'd prefer to set this as a session prop within the job that |
It sounds like sorting by those columns should help you, it has the advantage of not requiring additional tuning and potentially improves compression ratios in the file as well. |
Definitely, we have work ongoing to sort on at least one of these columns as well, although that change is a little more involved as it requires more write-time resources than creating these bloom filters does. Thanks for the review and approval! |
@cla-bot check |
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
The cla-bot has been summoned, and re-checked this pull request! |
@cla-bot check |
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
The cla-bot has been summoned, and re-checked this pull request! |
@cla-bot check |
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
The cla-bot has been summoned, and re-checked this pull request! |
@raunaqmorarka thanks for checking the CLA again – I submitted my CLA via email on October 8, but I see the |
You don't have to do anything more, CLAs should get updated soon |
@cla-bot check |
The cla-bot has been summoned, and re-checked this pull request! |
No docs update @raunaqmorarka ? |
Description
This PR adds support for configuring max rows per "row group" in an ORC file via a session property – today this is configurable only at the catalog level. This is useful for to make it easy for different jobs in the same catalog to write files that may have different optimal row group sizes based on the shape of the underlying data they're writing.
Also, I've submitted my CLA request via email, but looks like it hasn't been approved just yet.
Additional context and related issues
Fixes #23722
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: