-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support ZSTD compression with Parquet and Orc #3037
Comments
Need to check into CUDF support |
cudf does not support ZSTD. They are in the process of migrating to using nvcomp for handling codecs. nvcomp does not yet support ZSTD, but it is on their radar to investigate. |
Depends on rapidsai/cudf#9056 |
#6362 is the WIP PR to add support for parquet/orc write compression in spark-rapids, but because the CUDF feature is still experimental, we may want to hold off on enabling it. See #6362 (comment). |
We discussed this in our stand-up yesterday. The consensus was that we should enable this early in the 22.12 branch and then do more rigorous testing to ensure it is stable. Among things we need to verify:
|
Initial testing on desktop.
|
For my scale 100 (desktop) data gen tests, I regenerated the raw data using a parallel setting of 16 to produce larger source files. |
These are overall sizes for data converted from
Note that most of the reduction in size comes from the conversion from raw csv data to orc/parquet.
|
When comparing output of power-runs between cpu and and gpu as scale 100, I consistently see a difference in query79, regardless of compression (none, zstd, snappy). So I don't think it is a zstd issue, but might need more investigation. |
The q79 diff looks like this:
|
Power-run output is pretty small, but I am noticing some differences in the size of output for parquet for cpu vs gpu:
Most of the difference seem to be in the query 98 output:
Inspecting one of the partitions for query98 shows that in the gpu, two of the columns are uncompressed:
For CPU, these are compressed, although for i_category, the compressed size appears to be larger:
|
The line that is different is the |
Will want this fix in CUDF: rapidsai/cudf#11869 |
Testing on spark2a cluster at scale 3TB.
|
Spark2a (a100) data conversion sizes (via hadoop fs -count):
CPU raw to PARQUET NONE compression ratio: 2.66 GPU raw to GPU PARQUET compression ratio: 3.24 |
Comparing sizes of 3TB power-run outputs:
There is no significant difference in size for GPU and CPU when you run with GPU generated data vs CPU generated data. In the runs using zstd input data generated by the CPU, the GPU output data is about 1.44x larger. Similar to my scale 100 results, most of the difference is in query98 results.
I think this may be due to UNCOMPRESSED
|
Spark2a (a100) data conversion sizes for orc (via hadoop fs -count):
CPU raw to CPU ORC NONE compression ratio: 2.67 GPU raw to GPU ORC NONE compression ratio: 3.71 |
As a sanity check, I ran the nds2.0 power run on a dataproc cluster with P4 GPUs, with zstd parquet input data, and wrote the output in parquet/zstd format. |
Comparing sizes of 3TB power-run outputs for ORC:
There is no significant difference in size for GPU and CPU when you run with GPU generated data vs CPU generated data. In the runs using zstd input data generated by the CPU, the GPU output data is about 1.76x larger.
The sizes for GPU are consistently larger. Query98 is again the biggest difference at over 2.37x the CPU version. |
ORC NDS2.0 SCALE 3000 Data Conversion using CPU
ORC NDS2.0 SCALE 3000 Data Conversion using GPU
Total speedup for GPU was about 1.37x. This is just a single run and I have not spent any significant time trying to optimize it. You can see for the largest table, store_sales, speedup was nearly 2x. But on the next largest table,, catalog_sales, gpu was about 1.8x slower. |
PARQUET NDS2.0 SCALE 3000 Data Conversion using CPU
PARQUET NDS2.0 SCALE 3000 Data Conversion using GPU
Total overall speedup for GPU was about 1.30x. The speedup for the largest file (store_sales) was about 1.8x. The next largest file (catalog_sales) was about 1.9x Slower (speed up of 0.52). |
I collected the size/time info for converting to Parquet/zstd and Orc/zstd at 3TB scale |
Overall for NDS2.0 Data Conversion from Raw (CSV) at 3TB scale on GPU, |
When we convert from CSV to Parquet and Orc, GPU produces smaller files (higher compression ratio) than CPU when compression is set to NONE:
When we compare the parquet/orc files with no compression to those with ZSTD compression, the cpu has a better compression ratio, and in particular, we get no benefit on the GPU for these ORC files compared to using no compression:
The overall compression ratio from CSV to parquet/orc zstd is about the same at scale 100 and scale 3000:
|
NVCOMP zstd compression was added in 22.10, but marked experimental, meaning you have to define the environment variable `LIBCUDF_NVCOMP_POLICY=ALWAYS` to enable it. After completing validation testing using the spark rapids plugin as documented here: NVIDIA/spark-rapids#3037, we believe that we can now change the zstd compression status to stable, which will enable it in cudf by default. `LIBCUDF_NVCOMP_POLICY=STABLE` is the default value. Authors: - Jim Brennan (https://github.com/jbrennan333) Approvers: - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) URL: #12059
Is your feature request related to a problem? Please describe.
Feature request from a user asking for zstd compressed data support for Parquet with Spark rapids.
The text was updated successfully, but these errors were encountered: