Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parameters to control page size in Parquet writer #10882
Add parameters to control page size in Parquet writer #10882
Changes from 9 commits
9fe14b5
7dac75f
28c5806
d316c71
a869c13
1e730ff
774e08c
a206fb2
8534232
938c3b3
6c77bfc
cf02a0d
cab058e
045e73b
42c955f
88d0c06
63b8609
46317ee
1fe3fc7
39d027e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this validate the maximum row group size against the maximum page size?
I see the comment in the tests below suggesting to test parameters on initialization, but that isn't safe if this can be altered after initialization.
Perhaps the best approach is to add validation to this method, and make the "builder" interface always set the page size before the row group size. Then the error can be raised during the builder call OR after initialization.edit: sorry, that suggestion doesn't make sense. I'll think about this some more.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I was definitely wanting some help on this logic. my only concern is with confusion when modifying both row group and page sizes; the page size would have to be changed first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it makes the most sense to validate in
build()
. Allows any order and does not delay the validation too much (e.g. like validating in the reader would).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vuule Are you concerned about users calling
set_target_page_size_bytes
and creating an invalid state after the object is built? Is that allowed by the builder interface?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, that won't always work.
We do have an equivalent interface in ORC. There, this is solved by only having a hard coded minimum for these values instead of comparing the two in the setters. We ensure that the size of the "row group" (called stripe in ORC) is not smaller than the size of "page" (called row group in ORC) in the getters instead of setters! Setters allow stripes to be set to smaller size, but then getters for the row group size return the minimum of the two. Effectively, setting stripe size low also sets the row group size to the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bdice @vuule Should I take a stab at reproducing the ORC logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@etseidl Yes, that approach seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ORC approach is better than validating in build because we allow modifying the options struct after it's built.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this validate the maximum row group size against the maximum page size?