-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix V2 Parquet page alignment for use with zStandard compression #14841
Conversation
/ok to test |
/ok to test |
uncompressed_size += hdr_len; | ||
data_len = page_g.max_data_size; | ||
data_len = ck_g.is_compressed ? page_g.comp_data_size : page_g.data_size; | ||
// Copy page data. For V2, the level data and page data are disjoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disjoint because of the alignment padding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I suppose there will be cases where they are actually contiguous, but I don't know how worth it it is to test for that.
/ok to test |
trying to check if the bot works with the command at the end of the comment |
/ok to test |
/merge |
Description
Fixes #14781
This PR makes changes to the Parquet writer to ensure that data to be compressed is properly aligned. Changes have also been made to the
EncPage
struct to make it easier to keep fields in that struct aligned, and also to reduce confusing re-use of fields. In particular, themax_data_size
field can be any of a) the maximum possible size for the page data, b) the actual size of page data after encoding, c) the actual size of compressed page data. The latter two now have their own fields,data_size
andcomp_data_size
.Checklist