Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC compression of nested strings is worse than CPU and GPU parquet #13326

Closed
revans2 opened this issue May 10, 2023 · 5 comments
Closed
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented May 10, 2023

Describe the bug
When a customer was trying to write out data that is similar to this they saw that the output size of the ORC data written with CUDF(Spark) was more than 2x that of the same data written on the CPU.

For this particular customer we are talking about TiB of difference. Not only was it more expensive to store the data, the size difference was enough to slow down later jobs that read the data enough that they could not win against the CPU in performance.

Looking at the footers for the files in question it looks like the the GPU is not doing dictionary encoding where as the CPU is. Looking at the GPU code it is clear that we don't try to do dictionary encoding for ORC if there are more rows than would fit in a uint16_t

return rowgroup_bounds[rg_idx][col_idx].size() < std::numeric_limits<uint16_t>::max();

The default stripe size (at the top level) is 100,000 rows, which should allow dictionary encoding for all columns, but if a nested column with a LIST or a MAP in it has on average more than 7 entries it will not even be considered for dictionary encoding. I think it has something to do with wanting to compute the dictionary chunks in a single kernel call, saving memory while doing it, and not needing to compute 32-bit indexes as temporary values when trying to get to 16-bit indexes.

Whatever the reason it results in much larger files. I would be willing to have slightly slower run time for ORC compression and slightly more memory if it would allow us to compress some of these columns in these different cases.

Perhaps we could have an alternative path for string columns that have too many entries in them instead of just skipping them all together.

Steps/Code to reproduce bug
Take the attached file and rewrite the data using the GPU. For spark it is just a normal transcode, but for CUDF and the python API I am not 100% sure. I will try to make it happen, but I wanted to file this sooner than later.

Expected behavior
The ORC files produced would be much closer to the size of the files produced by the CPU. I know that we might not be smaller, but much closer in size would be good. It would be nice if we could get them close to the size of the same files written with Parquet on the GPU. Which is much better than the CPU/Spark implementation so I know we should be able to do something.

GPU CPU(Spark)
ORC 46 MiB 12 MiB
Parquet 15 MiB 42 MiB
@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels May 10, 2023
@GregoryKimball
Copy link
Contributor

GregoryKimball commented May 10, 2023

Thank you @revans2 for sharing this issue. It looks like the ~40 MiB files are the result of poor encoding and good compression, whereas the ~12 MiB files are benefitting from good encoding. I'm not sure why CPU-Spark supports better encoding for ORC and not Parquet, and cuDF support better encoding for Parquet but not ORC. It appears that Arrow (pyarrow) doesn't support this encoding for either file type.

I took a look at the example file and I see these data types in the columns:

  • column_a: ListDtype(StructDtype({'key': dtype('str'), 'value': dtype('str')}))
  • column_b: dtype('str')
  • column_c: ListDtype(StructDtype({'key': dtype('str'), 'value': dtype('str')}))
  • column_d: ListDtype(StructDtype({'key': dtype('str'), 'value': ListDtype('str')}))
  • column_e: ListDtype(StructDtype({'key': dtype('str'), 'value': ListDtype('str')}))

At least the pyarrow ORC writer and cudf ORC writer perform similarly here. With ORC, the data doesn't benefit much from encoding, but compression is very important. I serialized the columns out as JSON to assess the raw size, and then as uncompressed and snappy-compressed ORC files.

cudf (json) column_a 306.62695 MB
cudf (json) column_b 1.129174 MB
cudf (json) column_c 86.475776 MB
cudf (json) column_d 340.906389 MB
cudf (json) column_e 72.524789 MB

cudf (orc, comp=none) column_a 273.194111 MB
cudf (orc, comp=none) column_b 0.723437 MB
cudf (orc, comp=none) column_c 65.912865 MB
cudf (orc, comp=none) column_d 325.735563 MB
cudf (orc, comp=none) column_e 61.016666 MB

arrow (orc, comp=none) column_a 272.064602 MB
arrow (orc, comp=none) column_b 0.723048 MB
arrow (orc, comp=none) column_c 64.935149 MB
arrow (orc, comp=none) column_d 325.526691 MB
arrow (orc, comp=none) column_e 60.646472 MB

cudf (orc, comp=snappy) column_a 19.937584 MB
cudf (orc, comp=snappy) column_b 0.090547 MB
cudf (orc, comp=snappy) column_c 6.171918 MB
cudf (orc, comp=snappy) column_d 18.118038 MB
cudf (orc, comp=snappy) column_e 3.684152 MB

arrow (orc, comp=snappy) column_a 18.49074 MB
arrow (orc, comp=snappy) column_b 0.080907 MB
arrow (orc, comp=snappy) column_c 5.880804 MB
arrow (orc, comp=snappy) column_d 17.635911 MB
arrow (orc, comp=snappy) column_e 3.181155 MB

Looking at the same comparison with Parquet, and in this case cuDF provides very good encoding for column_a, column_d and column_e.

cudf (pq, comp=none) column_a 39.80402 MB
cudf (pq, comp=none) column_b 0.041326 MB
cudf (pq, comp=none) column_c 37.752079 MB
cudf (pq, comp=none) column_d 44.747784 MB
cudf (pq, comp=none) column_e 4.86588 MB

arrow (pq, comp=none) column_a 259.281885 MB
arrow (pq, comp=none) column_b 0.046871 MB
arrow (pq, comp=none) column_c 64.271626 MB
arrow (pq, comp=none) column_d 313.719652 MB
arrow (pq, comp=none) column_e 40.107961 MB

cudf (pq, comp=snappy) column_a 6.707211 MB
cudf (pq, comp=snappy) column_b 0.031987 MB
cudf (pq, comp=snappy) column_c 4.989441 MB
cudf (pq, comp=snappy) column_d 2.961035 MB
cudf (pq, comp=snappy) column_e 0.445479 MB

arrow (pq, comp=snappy) column_a 18.90587 MB
arrow (pq, comp=snappy) column_b 0.032212 MB
arrow (pq, comp=snappy) column_c 5.880203 MB
arrow (pq, comp=snappy) column_d 17.192449 MB
arrow (pq, comp=snappy) column_e 2.20101 MB

What is this encoding difference? Is this really a request to support dict-encoding for string children in List<Struct> types in ORC?

Looking through column_a, there are:

  • 38009 rows
  • 1740543 key-value pairs
  • 333 unique keys
  • 171381 unique values

Perhaps the issue is that the 171k values are not always dict-encoded, based on dictionary size limits.

@revans2
Copy link
Contributor Author

revans2 commented May 11, 2023

@GregoryKimball the goal of this is to get the size of the files written by CUDF ORC to be on par with what Spark does on the CPU for ORC. How we do it is up for debate.

I had spark write the data both compressed(snappy) and uncompressed.

125M	CPU_NO_COMPRESS_OUTPUT
12M	CPU_SNAPPY_OUTPUT
699M	GPU_NO_COMPRESS_OUTPUT
47M	GPU_SNAPPY_OUTPUT

So snappy is getting about a 10.8x compression ratio on the CPU file, and a 15.1x compression ratio on the GPU file. It also appears that the encoding for Spark is about 5.6x better than the encoding is for CUDF.

What is this encoding difference?

Looking at the file footer using the ORC tool

java -jar ~/src/orc/java/tools/target/orc-tools-1.8.0-uber.jar meta CPU_NO_COMPRESS_OUTPUT/*.orc
java -jar ~/src/orc/java/tools/target/orc-tools-1.8.0-uber.jar meta GPU_NO_COMPRESS_OUTPUT/*.orc

CPU_no_compress.txt
GPU_no_compress.txt

I can see a number of differences, but first lets make sure we are on the same page for the file, what the types are and how they map to the files.

The schema of the file is

struct<
  column_a:map<string,string>,
  column_b:string,
  column_c:map<string,string>,
  column_d:map<string,array<string>>,
  column_e:map<string,array<string>>
>

Because CUDF does not support a MAP type we represent it as a LIST<STRUCT<Key, Value>> where Key and Value are whatever the key and value are for the MAP.

  • Column 0 in the ORC file is the top level struct for all of the other columns.
  • Column 1 in the ORC file corresponds to the top level MAP of column_a (ORC does have top level support for a MAP so when it is written by Spark, or CUDF with the proper config settings it looks like a MAP).
  • Column 2 is the KEY of column_a
  • Column 3 is the VALUE of column_a
  • Column 4 is column_b
  • Column 5 is the top level MAP of column_c
  • Column 6 is the KEY of column_c
  • Column 7 is the VALUE of column_c
  • Column 8 is the top level MAP of column_d
  • Column 9 is the KEY of column_d
  • Column 10 is the VALUE of column_d which is also a LIST
  • Column 11 holds the strings from the LIST VALUE of column_d
  • Column 12 is the top level MAP of column_e
  • Column 13 is the KEY of column_e
  • Column 14 is the VALUE of column_e which is also a LIST
  • Column 15 holds the strings from the LIST VALUE of column_e

Now we can get into the details of the encodings that the size differences broken down by ORC column.

ORC Column CPU Size CPU Encoding GPU Size GPU Encoding GPU Size/CPU Size
0 45 DIRECT 16 DIRECT 0.36
1 15301 DIRECT_V2 15383 DIRECT_V2 1.00
2 2992296 DICTIONARY_V2[312] 13615057 DIRECT_V2 4.55
3 38670664 DICTIONARY_V2[91503] 259564650 DIRECT_V2 6.71
4 40930 DICTIONARY_V2[500] 722803 DIRECT_V2 17.66
5 12228 DIRECT_V2 12411 DIRECT_V2 1.01
6 1050398 DICTIONARY_V2[28] 3941709 DIRECT_V2 3.75
7 36860151 DICTIONARY_V2[73589] 61959751 DIRECT_V2 1.68
8 12124 DIRECT_V2 12286 DIRECT_V2 1.01
9 707490 DICTIONARY_V2[46] 8409000 DIRECT_V2 11.89
10 10463 DIRECT_V2 12010 DIRECT_V2 1.15
11 44026301 DICTIONARY_V2[13608] 317289877 DIRECT_V2 7.21
12 738 DIRECT_V2 844 DIRECT_V2 1.14
13 268349 DICTIONARY_V2[14] 7964282 DIRECT_V2 29.68
14 4299 DIRECT_V2 5238 DIRECT_V2 1.22
15 4591855 DICTIONARY_V2[2981] 53045600 DIRECT_V2 11.55

I think from this it is very clear that each time the size of the data is much larger on the CUDF generate file the CPU selected to do dictionary encoding of the string column or child column, but CUDF did not.

Is this really a request to support dict-encoding for string children in List types in ORC?

No. CUDF does support this. I guess my original comment didn't make that clear enough. The problem appears to be around limitations in CUDF when calculation the dictionary.

Perhaps the issue is that the 171k values are not always dict-encoded, based on dictionary size limits.

If any stripe would have more that 2^16 string rows in any column (child or otherwise) dictionary encoding is not even tried. That is what happens for all of the columns except for column_b (Column 4 in the ORC file). That one appears to be related to a check that happens later on after calculating the dictionary where CUDF feels that the "cost" of the dictionary is higher than the "cost" without it. I did not dig into this code, to really understand it, but with the CPU being 17x smaller than the GPU for that column I suspect that there is something odd with how we do dictionary encoding for ORC that makes it larger than it needs to be, or our cost calculations are off somehow.

I speculate that if we get dictionary encoding working for columns of Strings with more that 2^16 entries in them that will fix the majority of the problem. But because of memory and computation constraints it is likely going to have to be a separate code path from the current code path. The current code allocates the memory up front for the dictionaries and then appears to try and calculate them all in one kernel. For memory reasons it might be better to wait and do a distinct count on each large string column to see if it is worth trying to do a dictionary on it, and if it is, then we go through a kernel (possibly one at a time) that would calculate the dictionary. But that is just me speculating.

I still am really curious about column_b which would not be impacted by the above change. I want to understand why having 500 distinct entries in a stripe of 10000 rows is not enough to pay for the "cost" of the dictionary.

@vuule
Copy link
Contributor

vuule commented May 11, 2023

I still am really curious about column_b which would not be impacted by the above change. I want to understand why having 500 distinct entries in a stripe of 10000 rows is not enough to pay for the "cost" of the dictionary.

While looking at the 2^16 limitation I found an error in dictionary cost computation. I fixed one part of it, but there seems to be more.

@etseidl
Copy link
Contributor

etseidl commented May 11, 2023

I'm not sure why CPU-Spark supports better encoding for ORC and not Parquet

Because parquet-mr has a 1MB limit on the dictionary. The full dictionary for column_a.key_value.value is 32MB. If you bump up the limit (option("parquet.dictionary.page.size","40000000")) then spark winds up with a 15.8MB file, comparable in size to what cudf is producing.

@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jun 7, 2023
@GregoryKimball GregoryKimball moved this to In progress in libcudf Jun 7, 2023
rapids-bot bot pushed a commit that referenced this issue Jul 14, 2023
…3580)

Issue #13326, #10495

This PR reimplements creation of stripe dictionaries in ORC writer to eliminate row group size limitations.
New implementation uses `cuco::static_map` in a way that's very similar to the Parquet writer.

PR brings large performance gains because per-column X per-stripe sorting that invoked hundreds of thrust calls is now removed.
Also verified that the original row group size limit (2^16) for dictionary encoding is removed, allowing dictionaries to be applicable to large lists of strings.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13580
@GregoryKimball
Copy link
Contributor

Closed by #13580

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

4 participants