Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613

Closed
jbrennan333 opened this issue Jan 25, 2023 · 4 comments · Fixed by #12685
Closed

[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613

jbrennan333 opened this issue Jan 25, 2023 · 4 comments · Fixed by #12685
Labels
bug Something isn't working

Comments

@jbrennan333
Copy link
Contributor

Describe the bug
This is a follow up to #12249
While current 23.02 snapshot builds do work for the case described in #12249, the customer provided us with another file that fails to compress on the GPU.

The former case had average string sizes of about 2500 characters, while this new file had an average string size of over 4K for one column, with some strings larger than 20,000 characters.

Steps/Code to reproduce bug
I am attaching a parquet file with one column of semi-random strings that mimics the shape of the main column in the customer file. These are the sizes of the original uncompressed file and the cpu/gpu compressed versions:

jimb@jimb-ldt:/opt/data/jimb$ du -sh test-var*
71M	test-variable-strings
1.1M	test-variable-strings-cpu-zstd
71M	test-variable-strings-gpu-zstd

Expected behavior
We should expect the GPU zstd compressed file to be comparable in size to the cpu zstd compressed file.

Environment overview (please complete the following information)
I tested this using spark shell on an unbuntu system with an RTX4000 GPU. I used this spark rapids jar:
rapids-4-spark_2.12-23.02.0-20230122.152330-55-cuda11.jar

A tar archive with the uncompressed and cpu/gpu compressed versions of the test parquet file.
col-with-large-variable-strings.tgz

@jbrennan333 jbrennan333 added bug Something isn't working Needs Triage Need team to review and classify labels Jan 25, 2023
@jbrennan333
Copy link
Contributor Author

@etseidl, @vuule can you take a look? It seems we need to do more work to automatically handle compression of columns like this with potentially very large strings.

@vuule
Copy link
Contributor

vuule commented Jan 25, 2023

Thank you for the repro file, helps a lot!
I can look into this later this week.

@etseidl
Copy link
Contributor

etseidl commented Jan 25, 2023

Fragment size rears its ugly head again :( The default 5000 row fragments are around 21MB, so no zstd compression (limit is now 16MB). Setting the fragment size to 1000 works in this instance (gets page sizes down to 4MB), and results in a file 1099275 bytes in length, close to the 1064898 from the cpu zstd. You can also lower the max_page_size_bytes to 256MB or less which will scale the fragment size down (I'm pretty sure the fragment size setting is not yet exposed through the java interface).

Hopefully @vuule can get time to try implementing the auto fragment size tuning we talked about last time.

@etseidl
Copy link
Contributor

etseidl commented Jan 26, 2023

@jbrennan333 can you test #12627 with your customer data? I've added code to set the fragment size such that a fragment with wide rows will still fit inside a page, but the possible downside is too many fragments slowing things down.

rapids-bot bot pushed a commit that referenced this issue Feb 22, 2023
Fixes #12613

This PR adds the ability for columns to have different fragment sizes.  This allows a large fragment size for narrow columns, but allows for finer grained fragments for very wide columns.  This change should make wide columns fit (approximately) within page size constraints, and should help with compressors that rely on pages being under a certain threshold (i.e. Zstandard).

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #12685
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants