Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix V2 Parquet page alignment for use with zStandard compression #14841

Merged
merged 6 commits into from
Jan 30, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jan 23, 2024

Description

Fixes #14781

This PR makes changes to the Parquet writer to ensure that data to be compressed is properly aligned. Changes have also been made to the EncPage struct to make it easier to keep fields in that struct aligned, and also to reduce confusing re-use of fields. In particular, the max_data_size field can be any of a) the maximum possible size for the page data, b) the actual size of page data after encoding, c) the actual size of compressed page data. The latter two now have their own fields, data_size and comp_data_size.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner January 23, 2024 17:49
Copy link

copy-pr-bot bot commented Jan 23, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 23, 2024
@shrshi shrshi added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 23, 2024
@shrshi
Copy link
Contributor

shrshi commented Jan 23, 2024

/ok to test

@vuule vuule self-requested a review January 24, 2024 22:15
@hyperbolic2346
Copy link
Contributor

/ok to test

uncompressed_size += hdr_len;
data_len = page_g.max_data_size;
data_len = ck_g.is_compressed ? page_g.comp_data_size : page_g.data_size;
// Copy page data. For V2, the level data and page data are disjoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disjoint because of the alignment padding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I suppose there will be cases where they are actually contiguous, but I don't know how worth it it is to test for that.

@hyperbolic2346
Copy link
Contributor

/ok to test

@vuule
Copy link
Contributor

vuule commented Jan 30, 2024

trying to check if the bot works with the command at the end of the comment
/ok to test

@vuule
Copy link
Contributor

vuule commented Jan 30, 2024

/ok to test

@vuule
Copy link
Contributor

vuule commented Jan 30, 2024

/merge

@rapids-bot rapids-bot bot merged commit 6ed75ff into rapidsai:branch-24.04 Jan 30, 2024
68 checks passed
@etseidl etseidl deleted the v2_zstd_align branch January 30, 2024 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] Parquet V2 page headers cannot currently be used with zStandard compression
4 participants