[BUG] ORC compression of nested strings is worse than CPU and GPU parquet #13326

revans2 · 2023-05-10T15:08:53Z

Describe the bug
When a customer was trying to write out data that is similar to this they saw that the output size of the ORC data written with CUDF(Spark) was more than 2x that of the same data written on the CPU.

For this particular customer we are talking about TiB of difference. Not only was it more expensive to store the data, the size difference was enough to slow down later jobs that read the data enough that they could not win against the CPU in performance.

Looking at the footers for the files in question it looks like the the GPU is not doing dictionary encoding where as the CPU is. Looking at the GPU code it is clear that we don't try to do dictionary encoding for ORC if there are more rows than would fit in a uint16_t

cudf/cpp/src/io/orc/writer_impl.cu

Line 2112 in ac158da

    
           return rowgroup_bounds[rg_idx][col_idx].size() < std::numeric_limits<uint16_t>::max();

The default stripe size (at the top level) is 100,000 rows, which should allow dictionary encoding for all columns, but if a nested column with a LIST or a MAP in it has on average more than 7 entries it will not even be considered for dictionary encoding. I think it has something to do with wanting to compute the dictionary chunks in a single kernel call, saving memory while doing it, and not needing to compute 32-bit indexes as temporary values when trying to get to 16-bit indexes.

Whatever the reason it results in much larger files. I would be willing to have slightly slower run time for ORC compression and slightly more memory if it would allow us to compress some of these columns in these different cases.

Perhaps we could have an alternative path for string columns that have too many entries in them instead of just skipping them all together.

Steps/Code to reproduce bug
Take the attached file and rewrite the data using the GPU. For spark it is just a normal transcode, but for CUDF and the python API I am not 100% sure. I will try to make it happen, but I wanted to file this sooner than later.

Expected behavior
The ORC files produced would be much closer to the size of the files produced by the CPU. I know that we might not be smaller, but much closer in size would be good. It would be nice if we could get them close to the size of the same files written with Parquet on the GPU. Which is much better than the CPU/Spark implementation so I know we should be able to do something.

	GPU	CPU(Spark)
ORC	46 MiB	12 MiB
Parquet	15 MiB	42 MiB

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2023-05-10T18:02:58Z

Thank you @revans2 for sharing this issue. It looks like the ~40 MiB files are the result of poor encoding and good compression, whereas the ~12 MiB files are benefitting from good encoding. I'm not sure why CPU-Spark supports better encoding for ORC and not Parquet, and cuDF support better encoding for Parquet but not ORC. It appears that Arrow (pyarrow) doesn't support this encoding for either file type.

I took a look at the example file and I see these data types in the columns:

column_a: ListDtype(StructDtype({'key': dtype('str'), 'value': dtype('str')}))
column_b: dtype('str')
column_c: ListDtype(StructDtype({'key': dtype('str'), 'value': dtype('str')}))
column_d: ListDtype(StructDtype({'key': dtype('str'), 'value': ListDtype('str')}))
column_e: ListDtype(StructDtype({'key': dtype('str'), 'value': ListDtype('str')}))

At least the pyarrow ORC writer and cudf ORC writer perform similarly here. With ORC, the data doesn't benefit much from encoding, but compression is very important. I serialized the columns out as JSON to assess the raw size, and then as uncompressed and snappy-compressed ORC files.

cudf (json) column_a 306.62695 MB
cudf (json) column_b 1.129174 MB
cudf (json) column_c 86.475776 MB
cudf (json) column_d 340.906389 MB
cudf (json) column_e 72.524789 MB

cudf (orc, comp=none) column_a 273.194111 MB
cudf (orc, comp=none) column_b 0.723437 MB
cudf (orc, comp=none) column_c 65.912865 MB
cudf (orc, comp=none) column_d 325.735563 MB
cudf (orc, comp=none) column_e 61.016666 MB

arrow (orc, comp=none) column_a 272.064602 MB
arrow (orc, comp=none) column_b 0.723048 MB
arrow (orc, comp=none) column_c 64.935149 MB
arrow (orc, comp=none) column_d 325.526691 MB
arrow (orc, comp=none) column_e 60.646472 MB

cudf (orc, comp=snappy) column_a 19.937584 MB
cudf (orc, comp=snappy) column_b 0.090547 MB
cudf (orc, comp=snappy) column_c 6.171918 MB
cudf (orc, comp=snappy) column_d 18.118038 MB
cudf (orc, comp=snappy) column_e 3.684152 MB

arrow (orc, comp=snappy) column_a 18.49074 MB
arrow (orc, comp=snappy) column_b 0.080907 MB
arrow (orc, comp=snappy) column_c 5.880804 MB
arrow (orc, comp=snappy) column_d 17.635911 MB
arrow (orc, comp=snappy) column_e 3.181155 MB

Looking at the same comparison with Parquet, and in this case cuDF provides very good encoding for column_a, column_d and column_e.

cudf (pq, comp=none) column_a 39.80402 MB
cudf (pq, comp=none) column_b 0.041326 MB
cudf (pq, comp=none) column_c 37.752079 MB
cudf (pq, comp=none) column_d 44.747784 MB
cudf (pq, comp=none) column_e 4.86588 MB

arrow (pq, comp=none) column_a 259.281885 MB
arrow (pq, comp=none) column_b 0.046871 MB
arrow (pq, comp=none) column_c 64.271626 MB
arrow (pq, comp=none) column_d 313.719652 MB
arrow (pq, comp=none) column_e 40.107961 MB

cudf (pq, comp=snappy) column_a 6.707211 MB
cudf (pq, comp=snappy) column_b 0.031987 MB
cudf (pq, comp=snappy) column_c 4.989441 MB
cudf (pq, comp=snappy) column_d 2.961035 MB
cudf (pq, comp=snappy) column_e 0.445479 MB

arrow (pq, comp=snappy) column_a 18.90587 MB
arrow (pq, comp=snappy) column_b 0.032212 MB
arrow (pq, comp=snappy) column_c 5.880203 MB
arrow (pq, comp=snappy) column_d 17.192449 MB
arrow (pq, comp=snappy) column_e 2.20101 MB

What is this encoding difference? Is this really a request to support dict-encoding for string children in List<Struct> types in ORC?

Looking through column_a, there are:

38009 rows
1740543 key-value pairs
333 unique keys
171381 unique values

Perhaps the issue is that the 171k values are not always dict-encoded, based on dictionary size limits.

revans2 · 2023-05-11T16:46:32Z

@GregoryKimball the goal of this is to get the size of the files written by CUDF ORC to be on par with what Spark does on the CPU for ORC. How we do it is up for debate.

I had spark write the data both compressed(snappy) and uncompressed.

125M	CPU_NO_COMPRESS_OUTPUT
12M	CPU_SNAPPY_OUTPUT
699M	GPU_NO_COMPRESS_OUTPUT
47M	GPU_SNAPPY_OUTPUT

So snappy is getting about a 10.8x compression ratio on the CPU file, and a 15.1x compression ratio on the GPU file. It also appears that the encoding for Spark is about 5.6x better than the encoding is for CUDF.

What is this encoding difference?

Looking at the file footer using the ORC tool

java -jar ~/src/orc/java/tools/target/orc-tools-1.8.0-uber.jar meta CPU_NO_COMPRESS_OUTPUT/*.orc
java -jar ~/src/orc/java/tools/target/orc-tools-1.8.0-uber.jar meta GPU_NO_COMPRESS_OUTPUT/*.orc

CPU_no_compress.txt
GPU_no_compress.txt

I can see a number of differences, but first lets make sure we are on the same page for the file, what the types are and how they map to the files.

The schema of the file is

struct<
  column_a:map<string,string>,
  column_b:string,
  column_c:map<string,string>,
  column_d:map<string,array<string>>,
  column_e:map<string,array<string>>
>

Because CUDF does not support a MAP type we represent it as a LIST<STRUCT<Key, Value>> where Key and Value are whatever the key and value are for the MAP.

Column 0 in the ORC file is the top level struct for all of the other columns.
Column 1 in the ORC file corresponds to the top level MAP of column_a (ORC does have top level support for a MAP so when it is written by Spark, or CUDF with the proper config settings it looks like a MAP).
Column 2 is the KEY of column_a
Column 3 is the VALUE of column_a
Column 4 is column_b
Column 5 is the top level MAP of column_c
Column 6 is the KEY of column_c
Column 7 is the VALUE of column_c
Column 8 is the top level MAP of column_d
Column 9 is the KEY of column_d
Column 10 is the VALUE of column_d which is also a LIST
Column 11 holds the strings from the LIST VALUE of column_d
Column 12 is the top level MAP of column_e
Column 13 is the KEY of column_e
Column 14 is the VALUE of column_e which is also a LIST
Column 15 holds the strings from the LIST VALUE of column_e

Now we can get into the details of the encodings that the size differences broken down by ORC column.

ORC Column	CPU Size	CPU Encoding	GPU Size	GPU Encoding	GPU Size/CPU Size
0	45	DIRECT	16	DIRECT	0.36
1	15301	DIRECT_V2	15383	DIRECT_V2	1.00
2	2992296	DICTIONARY_V2[312]	13615057	DIRECT_V2	4.55
3	38670664	DICTIONARY_V2[91503]	259564650	DIRECT_V2	6.71
4	40930	DICTIONARY_V2[500]	722803	DIRECT_V2	17.66
5	12228	DIRECT_V2	12411	DIRECT_V2	1.01
6	1050398	DICTIONARY_V2[28]	3941709	DIRECT_V2	3.75
7	36860151	DICTIONARY_V2[73589]	61959751	DIRECT_V2	1.68
8	12124	DIRECT_V2	12286	DIRECT_V2	1.01
9	707490	DICTIONARY_V2[46]	8409000	DIRECT_V2	11.89
10	10463	DIRECT_V2	12010	DIRECT_V2	1.15
11	44026301	DICTIONARY_V2[13608]	317289877	DIRECT_V2	7.21
12	738	DIRECT_V2	844	DIRECT_V2	1.14
13	268349	DICTIONARY_V2[14]	7964282	DIRECT_V2	29.68
14	4299	DIRECT_V2	5238	DIRECT_V2	1.22
15	4591855	DICTIONARY_V2[2981]	53045600	DIRECT_V2	11.55

I think from this it is very clear that each time the size of the data is much larger on the CUDF generate file the CPU selected to do dictionary encoding of the string column or child column, but CUDF did not.

Is this really a request to support dict-encoding for string children in List types in ORC?

No. CUDF does support this. I guess my original comment didn't make that clear enough. The problem appears to be around limitations in CUDF when calculation the dictionary.

Perhaps the issue is that the 171k values are not always dict-encoded, based on dictionary size limits.

If any stripe would have more that 2^16 string rows in any column (child or otherwise) dictionary encoding is not even tried. That is what happens for all of the columns except for column_b (Column 4 in the ORC file). That one appears to be related to a check that happens later on after calculating the dictionary where CUDF feels that the "cost" of the dictionary is higher than the "cost" without it. I did not dig into this code, to really understand it, but with the CPU being 17x smaller than the GPU for that column I suspect that there is something odd with how we do dictionary encoding for ORC that makes it larger than it needs to be, or our cost calculations are off somehow.

I speculate that if we get dictionary encoding working for columns of Strings with more that 2^16 entries in them that will fix the majority of the problem. But because of memory and computation constraints it is likely going to have to be a separate code path from the current code path. The current code allocates the memory up front for the dictionaries and then appears to try and calculate them all in one kernel. For memory reasons it might be better to wait and do a distinct count on each large string column to see if it is worth trying to do a dictionary on it, and if it is, then we go through a kernel (possibly one at a time) that would calculate the dictionary. But that is just me speculating.

I still am really curious about column_b which would not be impacted by the above change. I want to understand why having 500 distinct entries in a stripe of 10000 rows is not enough to pay for the "cost" of the dictionary.

vuule · 2023-05-11T22:19:24Z

I still am really curious about column_b which would not be impacted by the above change. I want to understand why having 500 distinct entries in a stripe of 10000 rows is not enough to pay for the "cost" of the dictionary.

While looking at the 2^16 limitation I found an error in dictionary cost computation. I fixed one part of it, but there seems to be more.

etseidl · 2023-05-11T22:39:50Z

I'm not sure why CPU-Spark supports better encoding for ORC and not Parquet

Because parquet-mr has a 1MB limit on the dictionary. The full dictionary for column_a.key_value.value is 32MB. If you bump up the limit (option("parquet.dictionary.page.size","40000000")) then spark winds up with a 15.8MB file, comparable in size to what cudf is producing.

…3580) Issue #13326, #10495 This PR reimplements creation of stripe dictionaries in ORC writer to eliminate row group size limitations. New implementation uses `cuco::static_map` in a way that's very similar to the Parquet writer. PR brings large performance gains because per-column X per-stripe sorting that invoked hundreds of thrust calls is now removed. Also verified that the original row group size limit (2^16) for dictionary encoding is removed, allowing dictionaries to be applicable to large lists of strings. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #13580

GregoryKimball · 2023-07-31T19:28:30Z

Closed by #13580

revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels May 10, 2023

GregoryKimball assigned vuule Jun 7, 2023

GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jun 7, 2023

GregoryKimball added this to the ORC continuous improvement milestone Jun 7, 2023

GregoryKimball added this to libcudf Jun 7, 2023

GregoryKimball moved this to In progress in libcudf Jun 7, 2023

vuule mentioned this issue Jun 28, 2023

Use cuco::static_map to build string dictionaries in ORC writer #13580

Merged

3 tasks

GregoryKimball closed this as completed Jul 31, 2023

GregoryKimball removed the status in libcudf Aug 15, 2023

GregoryKimball removed this from libcudf Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC compression of nested strings is worse than CPU and GPU parquet #13326

[BUG] ORC compression of nested strings is worse than CPU and GPU parquet #13326

revans2 commented May 10, 2023

GregoryKimball commented May 10, 2023 •

edited

Loading

revans2 commented May 11, 2023 •

edited

Loading

vuule commented May 11, 2023

etseidl commented May 11, 2023

GregoryKimball commented Jul 31, 2023

[BUG] ORC compression of nested strings is worse than CPU and GPU parquet #13326

[BUG] ORC compression of nested strings is worse than CPU and GPU parquet #13326

Comments

revans2 commented May 10, 2023

GregoryKimball commented May 10, 2023 • edited Loading

revans2 commented May 11, 2023 • edited Loading

vuule commented May 11, 2023

etseidl commented May 11, 2023

GregoryKimball commented Jul 31, 2023

GregoryKimball commented May 10, 2023 •

edited

Loading

revans2 commented May 11, 2023 •

edited

Loading