[BUG] Compressing a table with large strings using ZSTD can result in little or no compression #12249

jbrennan333 · 2022-11-28T23:27:09Z

Describe the bug
A spark internal customer tried using zstd compression on the gpu in a 22.12 snapshot release and reported that they were getting no compression, while with cpu they were getting very good compression.
Using the first 100,000 rows of one of their tables, I got:

255M	2022-01-02-cpu-none
18M	2022-01-02-cpu-zstd
255M	2022-01-02-gpu-zstd

I was also able to repro with 100 rows, and with parquet-tools, I could see that most of the columns were uncompressed in the GPU version, in particular this one:

                ColumnChunk
                    meta_data = ColumnMetaData
                        type = 6
                        encodings = list
                            0
                            3
                        path_in_schema = list
                            data
                        num_values = 100
                        total_uncompressed_size = 250217
                        total_compressed_size = 250217

That data column contained strings of variable length which were all around 2500 characters long. Each string was a json structure with the same set of fields with differing values - so there were a lot of common characters.
The problem is that these columns were going over the 64KB limit for zstd, so the parquet writer was falling back to uncompressed.

Snappy does not appear to have this problem.

Steps/Code to reproduce bug
I was able to reproduce this by generating a table with 32 rows of strings, where each string consisted of random strings 64 characters long followed by an 8-character string repeated 248 times. I will attach a parquet file that reproduces the problem if you read it in and then write it out with zstd compression.
These were the results I got with 32 rows:

80K	test-data-32-cpu-none
16K	test-data-32-cpu-zstd
84K	test-data-32-gpu-zstd

And this is what I got with 31 rows (which keeps it under the 64KB limit):

76K	test-data-31-cpu-none
16K	test-data-31-cpu-zstd
20K	test-data-31-gpu-zstd

Expected behavior
When you compress a file with zstd using the gpu, it should provide some compression, ideally comparable to the CPU.

Environment overview (please complete the following information)
I tested this with Spark using a snapshot of the Spark-rapids plugin running on a 22.12 cuDF snapshot.

The text was updated successfully, but these errors were encountered:

jbrennan333 · 2022-11-28T23:31:30Z

repro-data.tgz

Uploaded a tar file with the uncompressed data, the cpu zstd compressed data, and the gpu zstd compressed data.
This is the example data that I generated that shows the problem.

jbrennan333 · 2022-11-28T23:31:52Z

@vuule can you please take a look at this?

vuule · 2022-11-29T00:37:36Z

@etseidl opened a PR that's potentially a fix for this issue: #12182

etseidl · 2022-11-29T02:06:54Z

Looking at the files, the issue is the fragment size. There are only 32 rows, but the data exceeds the 64k zstd buffer size by a few bytes. The fragment size would need to be set to 30 or less, and the fix from #12182 applied for compression to happen.

FWIW, I'm currently testing a prerelease of nvcomp that has a much larger buffer size, and do get slightly better compression when using that with #12211.

jbrennan333 · 2022-11-29T15:38:57Z

I have tested this with #12211, and it did not resolve it. In the customer case, the files are much larger, and the resulting row groups are much bigger as well.

jbrennan333 · 2022-11-29T15:40:53Z

Oh, and I did verify that setting the rows to 31 does allow compression to happen:

76K	test-data-31-cpu-none
16K	test-data-31-cpu-zstd
20K	test-data-31-gpu-zstd

etseidl · 2022-11-29T16:36:08Z

I guess we'll have to wait on nvcomp 2.5 then (although there's the risk of out-of-memory errors...I've been having to do a lot of parameter tuning to get optimal compression and batch size). I also think the page boundary calculation needs be reworked; lists and large binary values cause so much grief.

etseidl · 2022-11-29T19:27:13Z

@jbrennan333 I've modifed #12211 to add a max_page_fragment_size parameter to the writer. Setting this to 20 yielded a compressed file comparable in size to the cpu-zstd file. I'm curious to see if this will work with your larger data sets (and also wondering what effect there will be on performance).

jbrennan333 · 2022-11-29T21:15:37Z

I've modifed #12211 to add a max_page_fragment_size parameter to the writer. Setting this to 20 yielded a compressed file comparable in size to the cpu-zstd file. I'm curious to see if this will work with your larger data sets (and also wondering what effect there will be on performance).

I built with that patch and hard-coded max_page_fragment_size=20. This did result in more comparable compression results:

255M	2022-01-02-cpu-none
18M	2022-01-02-cpu-zstd-12211
25M	2022-01-02-gpu-zstd-12211

I don't have good performance numbers on this, since I'm just reading/writing these files on my desktop from spark shell. That said, I did time these writes and it was cpu: 1381 ms and gpu: 831 ms.

Are we considering using a very small fragment size like this, or possibly calculating it in some way?

vuule · 2022-11-29T21:22:39Z

The idea is to derive fragment size per column, based on the column data size. But it will take a bit to implement :)

etseidl · 2022-11-29T22:49:03Z

The idea is to derive fragment size per column, based on the column data size. But it will take a bit to implement :)

This. What I've done so far is a work around until what @vuule suggested can be implemented. It would require exposing this through the java interface, and then adding a way for spark users to set it. Would definitely be a power user kind of thing. But it's at least nice to see the small fragments a) worked, and b) gpu was faster than cpu.

The trouble with ultra small fragment sizes is that it adds a lot of memory burden on the writer, so we wouldn't want it to be the default. But for deeply nested data or huge strings, I don't think it can be avoided.

Also, some of this will be fixed once the zstd compressor supports larger buffer sizes, but there might still be need for parameter tuning.

jbrennan333 · 2023-01-06T15:06:12Z

@vuule, @etseidl do we have a sense of whether #12211 with nvcomp-2.5 or 2.5.1 will help with this case?

etseidl · 2023-01-06T16:40:38Z

@vuule, @etseidl do we have a sense of whether #12211 with nvcomp-2.5 or 2.5.1 will help with this case?

nvcomp-2.5 will help since it raises the compression buffer size to 16MB (from 64KB), so the fragment size issues should go away. The flip side of that, however, is that the temporary memory requirements are pretty severe. I've added code to dial back how many pages are compressed in each batch that should help with that. nvcomp-2.5.1 will have a more accurate temp space calculation which should help further. Maybe this is a good test case to resolve this discussion.

Can you upload a larger data set to experiment with?

jbrennan333 · 2023-01-06T21:15:45Z

I generated a similar sample with 10000 rows. Sizes are:

 20M	cpu-10000-data.parquet
504K	cpu-10000-data.zstd.parquet
 20M	gpu-10000-data.zstd.parquet

repro-data-10000-rows.tgz

etseidl · 2023-01-06T22:04:14Z

Using #12211 with nvcomp 2.5 I get

20M	cpu-10000-data.parquet
508K	cpu-10000-data.zstd.parquet
524K	gpu-10000-64k.parquet
20M	gpu-10000-data.zstd.parquet
524K	gpu-10000.parquet

gpu-10000.parquet is using all default values, and gpu-10000-64k.parquet uses 64k page sizes. The execution time is .3s for the former, .08 for the latter. 2 pages for the default, and 16 pages for 64k, so more opportunity for parallelism. This only used 98MB of temp buffer, so didn't really need any of the changes I made...this is all just larger nvcomp buffer.

jbrennan333 · 2023-01-06T22:12:13Z

That's good news! Thanks!

vuule · 2023-01-18T22:46:15Z

@jbrennan333 libcudf has made required changes to utilize nvCOMP 2.6, and switched the build to use it as well. Can we close this issue now?

jbrennan333 · 2023-01-18T23:21:06Z

@razajafri has also verified that this is fixed with nvcomp-2.6.

jbrennan333 added bug Something isn't working Needs Triage Need team to review and classify labels Nov 28, 2022

vuule added the cuIO cuIO issue label Nov 29, 2022

vuule removed the Needs Triage Need team to review and classify label Nov 29, 2022

jbrennan333 closed this as completed Jan 18, 2023

jbrennan333 mentioned this issue Jan 25, 2023

[BUG] Compressing a table with larger strings with ZSTD fails to compress #12613

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Compressing a table with large strings using ZSTD can result in little or no compression #12249

[BUG] Compressing a table with large strings using ZSTD can result in little or no compression #12249

jbrennan333 commented Nov 28, 2022

jbrennan333 commented Nov 28, 2022

jbrennan333 commented Nov 28, 2022

vuule commented Nov 29, 2022

etseidl commented Nov 29, 2022

jbrennan333 commented Nov 29, 2022

jbrennan333 commented Nov 29, 2022

etseidl commented Nov 29, 2022 •

edited

Loading

etseidl commented Nov 29, 2022

jbrennan333 commented Nov 29, 2022

vuule commented Nov 29, 2022

etseidl commented Nov 29, 2022

jbrennan333 commented Jan 6, 2023

etseidl commented Jan 6, 2023

jbrennan333 commented Jan 6, 2023

etseidl commented Jan 6, 2023

jbrennan333 commented Jan 6, 2023

vuule commented Jan 18, 2023

jbrennan333 commented Jan 18, 2023

[BUG] Compressing a table with large strings using ZSTD can result in little or no compression #12249

[BUG] Compressing a table with large strings using ZSTD can result in little or no compression #12249

Comments

jbrennan333 commented Nov 28, 2022

jbrennan333 commented Nov 28, 2022

jbrennan333 commented Nov 28, 2022

vuule commented Nov 29, 2022

etseidl commented Nov 29, 2022

jbrennan333 commented Nov 29, 2022

jbrennan333 commented Nov 29, 2022

etseidl commented Nov 29, 2022 • edited Loading

etseidl commented Nov 29, 2022

jbrennan333 commented Nov 29, 2022

vuule commented Nov 29, 2022

etseidl commented Nov 29, 2022

jbrennan333 commented Jan 6, 2023

etseidl commented Jan 6, 2023

jbrennan333 commented Jan 6, 2023

etseidl commented Jan 6, 2023

jbrennan333 commented Jan 6, 2023

vuule commented Jan 18, 2023

jbrennan333 commented Jan 18, 2023

etseidl commented Nov 29, 2022 •

edited

Loading