[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613

jbrennan333 · 2023-01-25T20:26:44Z

Describe the bug
This is a follow up to #12249
While current 23.02 snapshot builds do work for the case described in #12249, the customer provided us with another file that fails to compress on the GPU.

The former case had average string sizes of about 2500 characters, while this new file had an average string size of over 4K for one column, with some strings larger than 20,000 characters.

Steps/Code to reproduce bug
I am attaching a parquet file with one column of semi-random strings that mimics the shape of the main column in the customer file. These are the sizes of the original uncompressed file and the cpu/gpu compressed versions:

jimb@jimb-ldt:/opt/data/jimb$ du -sh test-var*
71M	test-variable-strings
1.1M	test-variable-strings-cpu-zstd
71M	test-variable-strings-gpu-zstd

Expected behavior
We should expect the GPU zstd compressed file to be comparable in size to the cpu zstd compressed file.

Environment overview (please complete the following information)
I tested this using spark shell on an unbuntu system with an RTX4000 GPU. I used this spark rapids jar:
rapids-4-spark_2.12-23.02.0-20230122.152330-55-cuda11.jar

A tar archive with the uncompressed and cpu/gpu compressed versions of the test parquet file.
col-with-large-variable-strings.tgz

The text was updated successfully, but these errors were encountered:

jbrennan333 · 2023-01-25T20:28:47Z

@etseidl, @vuule can you take a look? It seems we need to do more work to automatically handle compression of columns like this with potentially very large strings.

vuule · 2023-01-25T20:48:14Z

Thank you for the repro file, helps a lot!
I can look into this later this week.

etseidl · 2023-01-25T21:29:17Z

Fragment size rears its ugly head again :( The default 5000 row fragments are around 21MB, so no zstd compression (limit is now 16MB). Setting the fragment size to 1000 works in this instance (gets page sizes down to 4MB), and results in a file 1099275 bytes in length, close to the 1064898 from the cpu zstd. You can also lower the max_page_size_bytes to 256MB or less which will scale the fragment size down (I'm pretty sure the fragment size setting is not yet exposed through the java interface).

Hopefully @vuule can get time to try implementing the auto fragment size tuning we talked about last time.

etseidl · 2023-01-26T22:55:27Z

@jbrennan333 can you test #12627 with your customer data? I've added code to set the fragment size such that a fragment with wide rows will still fit inside a page, but the possible downside is too many fragments slowing things down.

Fixes #12613 This PR adds the ability for columns to have different fragment sizes. This allows a large fragment size for narrow columns, but allows for finer grained fragments for very wide columns. This change should make wide columns fit (approximately) within page size constraints, and should help with compressors that rely on pages being under a certain threshold (i.e. Zstandard). Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) URL: #12685

jbrennan333 added bug Something isn't working Needs Triage Need team to review and classify labels Jan 25, 2023

etseidl mentioned this issue Jan 26, 2023

[WIP] Adaptive fragment sizes in Parquet writer #12627

Closed

3 tasks

etseidl mentioned this issue Feb 2, 2023

Variable fragment sizes for Parquet writer #12685

Merged

3 tasks

rapids-bot bot closed this as completed in #12685 Feb 22, 2023

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613

[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613

jbrennan333 commented Jan 25, 2023

jbrennan333 commented Jan 25, 2023

vuule commented Jan 25, 2023

etseidl commented Jan 25, 2023

etseidl commented Jan 26, 2023

[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613

[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613

Comments

jbrennan333 commented Jan 25, 2023

jbrennan333 commented Jan 25, 2023

vuule commented Jan 25, 2023

etseidl commented Jan 25, 2023

etseidl commented Jan 26, 2023