-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer #14772
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer #14772
Conversation
This PR is a temporary fix that should be in 24.02. A proper fix will be ready for 24.04. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the workaround!
cpp/src/io/parquet/writer_impl.cu
Outdated
@@ -2184,6 +2184,9 @@ writer::impl::impl(std::vector<std::unique_ptr<data_sink>> sinks, | |||
if (options.get_metadata()) { | |||
_table_meta = std::make_unique<table_input_metadata>(*options.get_metadata()); | |||
} | |||
if (_write_v2_headers and _compression == Compression::ZSTD) { | |||
CUDF_FAIL("V2 page headers cannot be used with ZSTD compression"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need the same check for the chunked case as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Or should the test (and the lines above it) go in init_state()
instead? (And maybe change the name of init_state
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I'm late with the answer:)
Side note: the current code in init_state
should not be there (breaks the strong exception guarantee)
Co-authored-by: Vukasin Milovanovic <[email protected]>
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/merge |
Description
In the version 2 Parquet page header, neither the repetition nor definition level data is compressed. The current Parquet writer achieves this by offsetting the input buffers passed to nvcomp to skip this level data. Doing so can lead to mis-aligned data being passed to nvcomp (for zstd, input currently must be aligned on a 4 byte boundary). This PR is a short-term fix that will print an error and exit if zStandard compression is used with V2 page headers. This also fixes an underestimation of the maximum V2 page header size.
Checklist