-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ORC compression of nested strings is worse than CPU and GPU parquet #13326
Comments
Thank you @revans2 for sharing this issue. It looks like the ~40 MiB files are the result of poor encoding and good compression, whereas the ~12 MiB files are benefitting from good encoding. I'm not sure why CPU-Spark supports better encoding for ORC and not Parquet, and cuDF support better encoding for Parquet but not ORC. It appears that Arrow (pyarrow) doesn't support this encoding for either file type. I took a look at the example file and I see these data types in the columns:
At least the pyarrow ORC writer and cudf ORC writer perform similarly here. With ORC, the data doesn't benefit much from encoding, but compression is very important. I serialized the columns out as JSON to assess the raw size, and then as uncompressed and snappy-compressed ORC files.
Looking at the same comparison with Parquet, and in this case cuDF provides very good encoding for
What is this encoding difference? Is this really a request to support dict-encoding for string children in Looking through
Perhaps the issue is that the 171k values are not always dict-encoded, based on dictionary size limits. |
@GregoryKimball the goal of this is to get the size of the files written by CUDF ORC to be on par with what Spark does on the CPU for ORC. How we do it is up for debate. I had spark write the data both compressed(snappy) and uncompressed.
So snappy is getting about a 10.8x compression ratio on the CPU file, and a 15.1x compression ratio on the GPU file. It also appears that the encoding for Spark is about 5.6x better than the encoding is for CUDF.
Looking at the file footer using the ORC tool
CPU_no_compress.txt I can see a number of differences, but first lets make sure we are on the same page for the file, what the types are and how they map to the files. The schema of the file is
Because CUDF does not support a
Now we can get into the details of the encodings that the size differences broken down by ORC column.
I think from this it is very clear that each time the size of the data is much larger on the CUDF generate file the CPU selected to do dictionary encoding of the string column or child column, but CUDF did not.
No. CUDF does support this. I guess my original comment didn't make that clear enough. The problem appears to be around limitations in CUDF when calculation the dictionary.
If any stripe would have more that I speculate that if we get dictionary encoding working for columns of Strings with more that I still am really curious about |
While looking at the 2^16 limitation I found an error in dictionary cost computation. I fixed one part of it, but there seems to be more. |
Because parquet-mr has a 1MB limit on the dictionary. The full dictionary for column_a.key_value.value is 32MB. If you bump up the limit ( |
…3580) Issue #13326, #10495 This PR reimplements creation of stripe dictionaries in ORC writer to eliminate row group size limitations. New implementation uses `cuco::static_map` in a way that's very similar to the Parquet writer. PR brings large performance gains because per-column X per-stripe sorting that invoked hundreds of thrust calls is now removed. Also verified that the original row group size limit (2^16) for dictionary encoding is removed, allowing dictionaries to be applicable to large lists of strings. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #13580
Closed by #13580 |
Describe the bug
When a customer was trying to write out data that is similar to this they saw that the output size of the ORC data written with CUDF(Spark) was more than 2x that of the same data written on the CPU.
For this particular customer we are talking about TiB of difference. Not only was it more expensive to store the data, the size difference was enough to slow down later jobs that read the data enough that they could not win against the CPU in performance.
Looking at the footers for the files in question it looks like the the GPU is not doing dictionary encoding where as the CPU is. Looking at the GPU code it is clear that we don't try to do dictionary encoding for ORC if there are more rows than would fit in a
uint16_t
cudf/cpp/src/io/orc/writer_impl.cu
Line 2112 in ac158da
The default stripe size (at the top level) is 100,000 rows, which should allow dictionary encoding for all columns, but if a nested column with a LIST or a MAP in it has on average more than 7 entries it will not even be considered for dictionary encoding. I think it has something to do with wanting to compute the dictionary chunks in a single kernel call, saving memory while doing it, and not needing to compute 32-bit indexes as temporary values when trying to get to 16-bit indexes.
Whatever the reason it results in much larger files. I would be willing to have slightly slower run time for ORC compression and slightly more memory if it would allow us to compress some of these columns in these different cases.
Perhaps we could have an alternative path for string columns that have too many entries in them instead of just skipping them all together.
Steps/Code to reproduce bug
Take the attached file and rewrite the data using the GPU. For spark it is just a normal transcode, but for CUDF and the python API I am not 100% sure. I will try to make it happen, but I wanted to file this sooner than later.
Expected behavior
The ORC files produced would be much closer to the size of the files produced by the CPU. I know that we might not be smaller, but much closer in size would be good. It would be nice if we could get them close to the size of the same files written with Parquet on the GPU. Which is much better than the CPU/Spark implementation so I know we should be able to do something.
The text was updated successfully, but these errors were encountered: