Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Compressing a table with large strings using ZSTD can result in little or no compression #12249

Closed
jbrennan333 opened this issue Nov 28, 2022 · 18 comments
Labels
bug Something isn't working cuIO cuIO issue

Comments

@jbrennan333
Copy link
Contributor

Describe the bug
A spark internal customer tried using zstd compression on the gpu in a 22.12 snapshot release and reported that they were getting no compression, while with cpu they were getting very good compression.
Using the first 100,000 rows of one of their tables, I got:

255M	2022-01-02-cpu-none
18M	2022-01-02-cpu-zstd
255M	2022-01-02-gpu-zstd

I was also able to repro with 100 rows, and with parquet-tools, I could see that most of the columns were uncompressed in the GPU version, in particular this one:

                ColumnChunk
                    meta_data = ColumnMetaData
                        type = 6
                        encodings = list
                            0
                            3
                        path_in_schema = list
                            data
                        num_values = 100
                        total_uncompressed_size = 250217
                        total_compressed_size = 250217

That data column contained strings of variable length which were all around 2500 characters long. Each string was a json structure with the same set of fields with differing values - so there were a lot of common characters.
The problem is that these columns were going over the 64KB limit for zstd, so the parquet writer was falling back to uncompressed.

Snappy does not appear to have this problem.

Steps/Code to reproduce bug
I was able to reproduce this by generating a table with 32 rows of strings, where each string consisted of random strings 64 characters long followed by an 8-character string repeated 248 times. I will attach a parquet file that reproduces the problem if you read it in and then write it out with zstd compression.
These were the results I got with 32 rows:

80K	test-data-32-cpu-none
16K	test-data-32-cpu-zstd
84K	test-data-32-gpu-zstd

And this is what I got with 31 rows (which keeps it under the 64KB limit):

76K	test-data-31-cpu-none
16K	test-data-31-cpu-zstd
20K	test-data-31-gpu-zstd

Expected behavior
When you compress a file with zstd using the gpu, it should provide some compression, ideally comparable to the CPU.

Environment overview (please complete the following information)
I tested this with Spark using a snapshot of the Spark-rapids plugin running on a 22.12 cuDF snapshot.

@jbrennan333 jbrennan333 added bug Something isn't working Needs Triage Need team to review and classify labels Nov 28, 2022
@jbrennan333
Copy link
Contributor Author

repro-data.tgz

Uploaded a tar file with the uncompressed data, the cpu zstd compressed data, and the gpu zstd compressed data.
This is the example data that I generated that shows the problem.

@jbrennan333
Copy link
Contributor Author

@vuule can you please take a look at this?

@vuule vuule added the cuIO cuIO issue label Nov 29, 2022
@vuule vuule removed the Needs Triage Need team to review and classify label Nov 29, 2022
@vuule
Copy link
Contributor

vuule commented Nov 29, 2022

@etseidl opened a PR that's potentially a fix for this issue: #12182

@etseidl
Copy link
Contributor

etseidl commented Nov 29, 2022

Looking at the files, the issue is the fragment size. There are only 32 rows, but the data exceeds the 64k zstd buffer size by a few bytes. The fragment size would need to be set to 30 or less, and the fix from #12182 applied for compression to happen.

FWIW, I'm currently testing a prerelease of nvcomp that has a much larger buffer size, and do get slightly better compression when using that with #12211.

@jbrennan333
Copy link
Contributor Author

I have tested this with #12211, and it did not resolve it. In the customer case, the files are much larger, and the resulting row groups are much bigger as well.

@jbrennan333
Copy link
Contributor Author

Oh, and I did verify that setting the rows to 31 does allow compression to happen:

76K	test-data-31-cpu-none
16K	test-data-31-cpu-zstd
20K	test-data-31-gpu-zstd

@etseidl
Copy link
Contributor

etseidl commented Nov 29, 2022

I guess we'll have to wait on nvcomp 2.5 then (although there's the risk of out-of-memory errors...I've been having to do a lot of parameter tuning to get optimal compression and batch size). I also think the page boundary calculation needs be reworked; lists and large binary values cause so much grief.

@etseidl
Copy link
Contributor

etseidl commented Nov 29, 2022

@jbrennan333 I've modifed #12211 to add a max_page_fragment_size parameter to the writer. Setting this to 20 yielded a compressed file comparable in size to the cpu-zstd file. I'm curious to see if this will work with your larger data sets (and also wondering what effect there will be on performance).

@jbrennan333
Copy link
Contributor Author

I've modifed #12211 to add a max_page_fragment_size parameter to the writer. Setting this to 20 yielded a compressed file comparable in size to the cpu-zstd file. I'm curious to see if this will work with your larger data sets (and also wondering what effect there will be on performance).

I built with that patch and hard-coded max_page_fragment_size=20. This did result in more comparable compression results:

255M	2022-01-02-cpu-none
18M	2022-01-02-cpu-zstd-12211
25M	2022-01-02-gpu-zstd-12211

I don't have good performance numbers on this, since I'm just reading/writing these files on my desktop from spark shell. That said, I did time these writes and it was cpu: 1381 ms and gpu: 831 ms.

Are we considering using a very small fragment size like this, or possibly calculating it in some way?

@vuule
Copy link
Contributor

vuule commented Nov 29, 2022

The idea is to derive fragment size per column, based on the column data size. But it will take a bit to implement :)

@etseidl
Copy link
Contributor

etseidl commented Nov 29, 2022

The idea is to derive fragment size per column, based on the column data size. But it will take a bit to implement :)

This. What I've done so far is a work around until what @vuule suggested can be implemented. It would require exposing this through the java interface, and then adding a way for spark users to set it. Would definitely be a power user kind of thing. But it's at least nice to see the small fragments a) worked, and b) gpu was faster than cpu.

The trouble with ultra small fragment sizes is that it adds a lot of memory burden on the writer, so we wouldn't want it to be the default. But for deeply nested data or huge strings, I don't think it can be avoided.

Also, some of this will be fixed once the zstd compressor supports larger buffer sizes, but there might still be need for parameter tuning.

@jbrennan333
Copy link
Contributor Author

@vuule, @etseidl do we have a sense of whether #12211 with nvcomp-2.5 or 2.5.1 will help with this case?

@etseidl
Copy link
Contributor

etseidl commented Jan 6, 2023

@vuule, @etseidl do we have a sense of whether #12211 with nvcomp-2.5 or 2.5.1 will help with this case?

nvcomp-2.5 will help since it raises the compression buffer size to 16MB (from 64KB), so the fragment size issues should go away. The flip side of that, however, is that the temporary memory requirements are pretty severe. I've added code to dial back how many pages are compressed in each batch that should help with that. nvcomp-2.5.1 will have a more accurate temp space calculation which should help further. Maybe this is a good test case to resolve this discussion.

Can you upload a larger data set to experiment with?

@jbrennan333
Copy link
Contributor Author

I generated a similar sample with 10000 rows. Sizes are:

 20M	cpu-10000-data.parquet
504K	cpu-10000-data.zstd.parquet
 20M	gpu-10000-data.zstd.parquet

repro-data-10000-rows.tgz

@etseidl
Copy link
Contributor

etseidl commented Jan 6, 2023

Using #12211 with nvcomp 2.5 I get

20M	cpu-10000-data.parquet
508K	cpu-10000-data.zstd.parquet
524K	gpu-10000-64k.parquet
20M	gpu-10000-data.zstd.parquet
524K	gpu-10000.parquet

gpu-10000.parquet is using all default values, and gpu-10000-64k.parquet uses 64k page sizes. The execution time is .3s for the former, .08 for the latter. 2 pages for the default, and 16 pages for 64k, so more opportunity for parallelism. This only used 98MB of temp buffer, so didn't really need any of the changes I made...this is all just larger nvcomp buffer.

@jbrennan333
Copy link
Contributor Author

That's good news! Thanks!

@vuule
Copy link
Contributor

vuule commented Jan 18, 2023

@jbrennan333 libcudf has made required changes to utilize nvCOMP 2.6, and switched the build to use it as well. Can we close this issue now?

@jbrennan333
Copy link
Contributor Author

@razajafri has also verified that this is fixed with nvcomp-2.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

No branches or pull requests

3 participants