Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parameters to control page size in Parquet writer #10882
Add parameters to control page size in Parquet writer #10882
Changes from all commits
9fe14b5
7dac75f
28c5806
d316c71
a869c13
1e730ff
774e08c
a206fb2
8534232
938c3b3
6c77bfc
cf02a0d
cab058e
045e73b
42c955f
88d0c06
63b8609
46317ee
1fe3fc7
39d027e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You had uncompressed below, it would be good to include it here in this too as users are more likely to see it here.
Also is this a hard limit or a soft limit. i.e. what happens if a single row cannot be split, i.e. string or binary, and is larger than this is set to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. This is a soft limit, since the current implementation operates a fragment-at-a-time, allowing a single fragment to over-fill a page. I believe this is consistent with other parquet writers, such as parquet-mr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this impact lists? I assume that the data column will be split on rows too, but it would be good to have that in the docs. For row groups it is the top level number of rows, it cannot be anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with row groups, this is on row count, not value count. I originally wasn't thinking of adding the row limit, but changed my mind later to allow cudf to have something akin to the row count limit from parquet-mr. Will modify the documentation.
Question: Should I change the default to 0 and not limit on row count if that's the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuIO Parquet writer guarantees that lists won't be split across pages. Upcoming
DataPageHeaderV2
in parquet spec also guarantees this.