-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613
Comments
Thank you for the repro file, helps a lot! |
Fragment size rears its ugly head again :( The default 5000 row fragments are around 21MB, so no zstd compression (limit is now 16MB). Setting the fragment size to 1000 works in this instance (gets page sizes down to 4MB), and results in a file 1099275 bytes in length, close to the 1064898 from the cpu zstd. You can also lower the Hopefully @vuule can get time to try implementing the auto fragment size tuning we talked about last time. |
@jbrennan333 can you test #12627 with your customer data? I've added code to set the fragment size such that a fragment with wide rows will still fit inside a page, but the possible downside is too many fragments slowing things down. |
Fixes #12613 This PR adds the ability for columns to have different fragment sizes. This allows a large fragment size for narrow columns, but allows for finer grained fragments for very wide columns. This change should make wide columns fit (approximately) within page size constraints, and should help with compressors that rely on pages being under a certain threshold (i.e. Zstandard). Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) URL: #12685
Describe the bug
This is a follow up to #12249
While current 23.02 snapshot builds do work for the case described in #12249, the customer provided us with another file that fails to compress on the GPU.
The former case had average string sizes of about 2500 characters, while this new file had an average string size of over 4K for one column, with some strings larger than 20,000 characters.
Steps/Code to reproduce bug
I am attaching a parquet file with one column of semi-random strings that mimics the shape of the main column in the customer file. These are the sizes of the original uncompressed file and the cpu/gpu compressed versions:
Expected behavior
We should expect the GPU zstd compressed file to be comparable in size to the cpu zstd compressed file.
Environment overview (please complete the following information)
I tested this using spark shell on an unbuntu system with an RTX4000 GPU. I used this spark rapids jar:
rapids-4-spark_2.12-23.02.0-20230122.152330-55-cuda11.jar
A tar archive with the uncompressed and cpu/gpu compressed versions of the test parquet file.
col-with-large-variable-strings.tgz
The text was updated successfully, but these errors were encountered: