-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Compressing a table with large strings using ZSTD can result in little or no compression #12249
Comments
Uploaded a tar file with the uncompressed data, the cpu zstd compressed data, and the gpu zstd compressed data. |
@vuule can you please take a look at this? |
Looking at the files, the issue is the fragment size. There are only 32 rows, but the data exceeds the 64k zstd buffer size by a few bytes. The fragment size would need to be set to 30 or less, and the fix from #12182 applied for compression to happen. FWIW, I'm currently testing a prerelease of nvcomp that has a much larger buffer size, and do get slightly better compression when using that with #12211. |
I have tested this with #12211, and it did not resolve it. In the customer case, the files are much larger, and the resulting row groups are much bigger as well. |
Oh, and I did verify that setting the rows to 31 does allow compression to happen:
|
I guess we'll have to wait on nvcomp 2.5 then (although there's the risk of out-of-memory errors...I've been having to do a lot of parameter tuning to get optimal compression and batch size). I also think the page boundary calculation needs be reworked; lists and large binary values cause so much grief. |
@jbrennan333 I've modifed #12211 to add a |
I built with that patch and hard-coded
I don't have good performance numbers on this, since I'm just reading/writing these files on my desktop from spark shell. That said, I did time these writes and it was Are we considering using a very small fragment size like this, or possibly calculating it in some way? |
The idea is to derive fragment size per column, based on the column data size. But it will take a bit to implement :) |
This. What I've done so far is a work around until what @vuule suggested can be implemented. It would require exposing this through the java interface, and then adding a way for spark users to set it. Would definitely be a power user kind of thing. But it's at least nice to see the small fragments a) worked, and b) gpu was faster than cpu. The trouble with ultra small fragment sizes is that it adds a lot of memory burden on the writer, so we wouldn't want it to be the default. But for deeply nested data or huge strings, I don't think it can be avoided. Also, some of this will be fixed once the zstd compressor supports larger buffer sizes, but there might still be need for parameter tuning. |
nvcomp-2.5 will help since it raises the compression buffer size to 16MB (from 64KB), so the fragment size issues should go away. The flip side of that, however, is that the temporary memory requirements are pretty severe. I've added code to dial back how many pages are compressed in each batch that should help with that. nvcomp-2.5.1 will have a more accurate temp space calculation which should help further. Maybe this is a good test case to resolve this discussion. Can you upload a larger data set to experiment with? |
I generated a similar sample with 10000 rows. Sizes are:
|
Using #12211 with nvcomp 2.5 I get
|
That's good news! Thanks! |
@jbrennan333 libcudf has made required changes to utilize nvCOMP 2.6, and switched the build to use it as well. Can we close this issue now? |
@razajafri has also verified that this is fixed with nvcomp-2.6. |
Describe the bug
A spark internal customer tried using zstd compression on the gpu in a 22.12 snapshot release and reported that they were getting no compression, while with cpu they were getting very good compression.
Using the first 100,000 rows of one of their tables, I got:
I was also able to repro with 100 rows, and with parquet-tools, I could see that most of the columns were uncompressed in the GPU version, in particular this one:
That data column contained strings of variable length which were all around 2500 characters long. Each string was a json structure with the same set of fields with differing values - so there were a lot of common characters.
The problem is that these columns were going over the 64KB limit for zstd, so the parquet writer was falling back to uncompressed.
Snappy does not appear to have this problem.
Steps/Code to reproduce bug
I was able to reproduce this by generating a table with 32 rows of strings, where each string consisted of random strings 64 characters long followed by an 8-character string repeated 248 times. I will attach a parquet file that reproduces the problem if you read it in and then write it out with zstd compression.
These were the results I got with 32 rows:
And this is what I got with 31 rows (which keeps it under the 64KB limit):
Expected behavior
When you compress a file with zstd using the gpu, it should provide some compression, ideally comparable to the CPU.
Environment overview (please complete the following information)
I tested this with Spark using a snapshot of the Spark-rapids plugin running on a 22.12 cuDF snapshot.
The text was updated successfully, but these errors were encountered: