Enable ZSTD compression in ORC and Parquet writers #11551

vuule · 2022-08-17T07:33:00Z

Description

Expands nvCOMP adapter to include ZSTD compression.
Adds centralized nvCOMP policy. is_compression_enabled.
Adds centralized nvCOMP alignment utility, compress_input_alignment_bits.
Adds centralized nvCOMP utility to get the maximum supported compression chunk size - batched_compress_max_allowed_chunk_size.
Encoded ORC row groups are aligned based on compression requirements.
Encoded Parquet pages are aligned based on compression requirements.
Parquet fragment size now scales with the page size to better fit the default page size with ZSTD compression.
Small refactoring around decompress_status for improved type safety and hopefully naming.
Replaced snappy_compress from the Parquet writer with the nvCOMP adapter call.
Vectors of compression_results are initialized before compression to avoid issues with random chunk skipping due to uninitialized memory.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

codecov · 2022-08-17T09:26:18Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@dca285b). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 27ce95e differs from pull request most recent head 1f60695. Consider uploading reports for the commit 1f60695 to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-22.10   #11551   +/-   ##
===============================================
  Coverage                ?   86.42%           
===============================================
  Files                   ?      145           
  Lines                   ?    23009           
  Branches                ?        0           
===============================================
  Hits                    ?    19885           
  Misses                  ?     3124           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

…into fea-nvcomp-zstd-comp

jbrennan333

A few minor comments. Also need to add ZSTD (and it looks like a few others) to CompressionType.java. It should match the compression_type enum from types.hpp.

cpp/cmake/thirdparty/get_nvcomp.cmake

cpp/src/io/comp/nvcomp_adapter.cpp

jlowe

Java approval

jbrennan333

+1 this looks good to me. Great work cleaning this up!

One question. As I understand it, if LIBCUDF_NVCOMP_POLICY=STABLE, choosing ZSTD compression will result in uncompressed output (as opposed to a failure), is that correct?

vuule · 2022-09-06T19:03:09Z

+1 this looks good to me. Great work cleaning this up!

One question. As I understand it, if LIBCUDF_NVCOMP_POLICY=STABLE, choosing ZSTD compression will result in uncompressed output (as opposed to a failure), is that correct?

It will actually fail with "unsupported compression type". This is the behavior with DEFLATE (ZLIB) as well. I'm rethinking this approach, as users already "opt-in" to the new feature by selecting the ZSTD compression when writing. Any preference on your end?

jbrennan333 · 2022-09-06T20:06:05Z

One question. As I understand it, if LIBCUDF_NVCOMP_POLICY=STABLE, choosing ZSTD compression will result in uncompressed output (as opposed to a failure), is that correct?

It will actually fail with "unsupported compression type". This is the behavior with DEFLATE (ZLIB) as well. I'm rethinking this approach, as users already "opt-in" to the new feature by selecting the ZSTD compression when writing. Any preference on your end?

Currently in the spark rapids plugin we don't use the GPU for writing parquet/orc if the compression type is ZSTD. Once we enable that, any spark job that selects zstd as the compressor will fail with unsupported compression type if they don't define LIBCUDF_NVCOMP_POLICY=ALWAYS.

If we change this to silently write uncompressed data, then the job will succeed, but data will be uncompressed. This seems worse because there would be no indication that anything was wrong (other than output size).

So I think the failure is better. The question for spark-rapids plugin is whether we wait for this to be stable before enabling it in the plugin, or document the need to define LIBCUDF_NVCOMP_POLICY=ALWAYS if you are using ZSTD compression.

…fea-nvcomp-zstd-comp

python/cudf/cudf/tests/test_parquet.py

mroeschke

Minor not-blocking comments for the Python code. LGTM

vuule · 2022-09-12T21:18:34Z

@gpucibot merge

The recently merged PR (#11551) did not include the `<optional>` header which may cause compile error in some systems (in particular, CUDA 11.7 + gcc-11.2): ``` error: ‘std::optional’ has not been declared error: ‘optional’ in namespace ‘std’ does not name a template type ``` This PR adds that missing header to fix the compile issue. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) URL: #11697

This PR fixes an error that can occur when very small page sizes are used when writing Parquet files. #11551 changed from fixed 5000 row page fragments to a scaled value based on the requested max page size. For small page sizes, the number of fragments to process can exceed 64k. The number of fragments is used as the `y` dimension when calling `gpuInitPageFragments`, and when it exceeds 64k the kernel fails to launch, ultimately leading to an invalid memory access. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #11869

xingwenqiang · 2023-07-18T07:33:31Z

hi vuule,
I see that nvcomp supports deflate gpu encoding, and orc supports zlib compression. I wonder if parquet will support zlib gpu compression in the future?

vuule added 3 commits August 17, 2022 00:27

add ZSTD compression to the adapter

d2b9a1c

C++ changes

fd4c440

Python changes

61d607b

vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Aug 17, 2022

vuule self-assigned this Aug 17, 2022

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 17, 2022

cmake workaround

782b435

github-actions bot added the CMake CMake build issue label Aug 17, 2022

vuule added 4 commits August 17, 2022 15:26

untested Parquet C++

2478bd6

tmp tests

6291190

Merge branch 'fea-nvcomp-zstd-comp' of https://github.com/vuule/cudf …

e2c6109

…into fea-nvcomp-zstd-comp

Parquet Python

4641c11

github-actions bot added the Java Affects Java cuDF API. label Aug 18, 2022

vuule changed the base branch from branch-22.10 to branch-22.08 August 18, 2022 00:06

vuule changed the base branch from branch-22.08 to branch-22.10 August 18, 2022 00:06

vuule added 2 commits August 17, 2022 17:20

revert temp tests

70c30f6

py tests

198f169

github-actions bot removed the Java Affects Java cuDF API. label Aug 18, 2022

style :D

0dbf2a0

vuule changed the title ~~Enable ZSTD compression in ORC writer~~ Enable ZSTD compression in ORC and Parquet writers Aug 18, 2022

jbrennan333 reviewed Aug 18, 2022

View reviewed changes

cpp/cmake/thirdparty/get_nvcomp.cmake Outdated Show resolved Hide resolved

cpp/src/io/comp/nvcomp_adapter.cpp Show resolved Hide resolved

cpp/src/io/comp/nvcomp_adapter.cpp Show resolved Hide resolved

jbrennan333 mentioned this pull request Aug 18, 2022

[FEA] Add support for using nvcomp ZSTD compression NVIDIA/spark-rapids#6362

Merged

vuule added 3 commits August 18, 2022 12:00

update java compression types

f000bb8

compression block limit in ORC

a5f81ed

ORC compression check

377864f

copyright year

58f8439

jlowe approved these changes Sep 6, 2022

View reviewed changes

jbrennan333 approved these changes Sep 6, 2022

View reviewed changes

vuule requested a review from hyperbolic2346 September 6, 2022 21:04

Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …

aaaaa12

…fea-nvcomp-zstd-comp

hyperbolic2346 approved these changes Sep 6, 2022

View reviewed changes

nvcomp version fallback in is_compression_enabled

8bf2a11

upsj approved these changes Sep 9, 2022

View reviewed changes

mroeschke reviewed Sep 9, 2022

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved

mroeschke reviewed Sep 9, 2022

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

mroeschke approved these changes Sep 9, 2022

View reviewed changes

vuule added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 9, 2022

vuule added 2 commits September 9, 2022 15:44

rename status to result to match the type name

632a6a9

address Python code review

282af9d

vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Sep 9, 2022

vuule added 2 commits September 9, 2022 18:30

style

f766f00

Merge branch 'branch-22.10' into fea-nvcomp-zstd-comp

1f60695

rapids-bot bot merged commit 578e65f into rapidsai:branch-22.10 Sep 12, 2022

ttnghia mentioned this pull request Sep 13, 2022

Fix compile error due to missing header #11697

Merged

vuule deleted the fea-nvcomp-zstd-comp branch September 19, 2022 20:17

etseidl mentioned this pull request Oct 6, 2022

Fix writing of Parquet files with many fragments #11869

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable ZSTD compression in ORC and Parquet writers #11551

Enable ZSTD compression in ORC and Parquet writers #11551

vuule commented Aug 17, 2022 •

edited

Loading

codecov bot commented Aug 17, 2022 •

edited

Loading

jbrennan333 left a comment

jlowe left a comment

jbrennan333 left a comment

vuule commented Sep 6, 2022

jbrennan333 commented Sep 6, 2022

mroeschke left a comment

vuule commented Sep 12, 2022

xingwenqiang commented Jul 18, 2023

Enable ZSTD compression in ORC and Parquet writers #11551

Enable ZSTD compression in ORC and Parquet writers #11551

Conversation

vuule commented Aug 17, 2022 • edited Loading

Description

Checklist

codecov bot commented Aug 17, 2022 • edited Loading

Codecov Report

jbrennan333 left a comment

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

jbrennan333 left a comment

Choose a reason for hiding this comment

vuule commented Sep 6, 2022

jbrennan333 commented Sep 6, 2022

mroeschke left a comment

Choose a reason for hiding this comment

vuule commented Sep 12, 2022

xingwenqiang commented Jul 18, 2023

vuule commented Aug 17, 2022 •

edited

Loading

codecov bot commented Aug 17, 2022 •

edited

Loading