Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer #14772

Merged
merged 14 commits into from
Jan 22, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jan 17, 2024

Description

In the version 2 Parquet page header, neither the repetition nor definition level data is compressed. The current Parquet writer achieves this by offsetting the input buffers passed to nvcomp to skip this level data. Doing so can lead to mis-aligned data being passed to nvcomp (for zstd, input currently must be aligned on a 4 byte boundary). This PR is a short-term fix that will print an error and exit if zStandard compression is used with V2 page headers. This also fixes an underestimation of the maximum V2 page header size.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner January 17, 2024 21:59
Copy link

copy-pr-bot bot commented Jan 17, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 17, 2024
@etseidl etseidl changed the title Work around incompatabilities between V2 page header handling and zStandard compression in Parquet writer Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer Jan 17, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Jan 17, 2024

This PR is a temporary fix that should be in 24.02. A proper fix will be ready for 24.04.

@vuule vuule self-requested a review January 18, 2024 00:04
@vuule vuule added bug Something isn't working non-breaking Non-breaking change cuIO cuIO issue labels Jan 18, 2024
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the workaround!

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
@@ -2184,6 +2184,9 @@ writer::impl::impl(std::vector<std::unique_ptr<data_sink>> sinks,
if (options.get_metadata()) {
_table_meta = std::make_unique<table_input_metadata>(*options.get_metadata());
}
if (_write_v2_headers and _compression == Compression::ZSTD) {
CUDF_FAIL("V2 page headers cannot be used with ZSTD compression");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the same check for the chunked case as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Or should the test (and the lines above it) go in init_state() instead? (And maybe change the name of init_state).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I'm late with the answer:)
Side note: the current code in init_state should not be there (breaks the strong exception guarantee)

cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
Co-authored-by: Vukasin Milovanovic <[email protected]>
@davidwendt
Copy link
Contributor

/ok to test

@davidwendt
Copy link
Contributor

/ok to test

cpp/src/io/parquet/page_enc.cu Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented Jan 18, 2024

/ok to test

@davidwendt
Copy link
Contributor

/ok to test

@vuule
Copy link
Contributor

vuule commented Jan 22, 2024

/merge

@rapids-bot rapids-bot bot merged commit f24f0b5 into rapidsai:branch-24.02 Jan 22, 2024
67 checks passed
@etseidl etseidl deleted the v2_zstd_workaround branch January 22, 2024 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants