Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for FIXED_LEN_BYTE_ARRAY when calculating fragment sizes in Parquet writer #16064

Merged
merged 2 commits into from
Jun 24, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jun 19, 2024

Description

The number of rows per fragment will be off by a factor of 4 for FIXED_LEN_BYTE_ARRAY columns. This results in many more fragments than are necessary to achieve user requested page size limits. This PR shifts where the determination of whether a column has fixed-width data to a location where knowledge of the schema can be used.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner June 19, 2024 17:28
@etseidl etseidl requested review from karthikeyann and ttnghia June 19, 2024 17:28
Copy link

copy-pr-bot bot commented Jun 19, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jun 19, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Jun 19, 2024

cc @vuule @mhaseeb123

@etseidl etseidl changed the title Accouont for FIXED_LEN_BYTE_ARRAY when calculating fragment sizes in Parquet writer Account for FIXED_LEN_BYTE_ARRAY when calculating fragment sizes in Parquet writer Jun 19, 2024
@PointKernel PointKernel added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuIO cuIO issue labels Jun 20, 2024
@PointKernel
Copy link
Member

/ok to test

@vuule
Copy link
Contributor

vuule commented Jun 24, 2024

/merge

@rapids-bot rapids-bot bot merged commit 9987410 into rapidsai:branch-24.08 Jun 24, 2024
73 checks passed
@etseidl etseidl deleted the fixed_len branch June 24, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants