Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update doc to indicate ORC and Parquet zstd read support [skip ci] #6589

Merged
merged 2 commits into from
Sep 21, 2022

Conversation

sameerz
Copy link
Collaborator

@sameerz sameerz commented Sep 21, 2022

Update compatibility.md to note read and write support for zstd for ORC and Parquet.

Closes issue #6570

@sameerz sameerz added the documentation Improvements or additions to documentation label Sep 21, 2022
@sameerz sameerz added this to the Sep 5 - Sep 23 milestone Sep 21, 2022
@sameerz sameerz requested review from tgravescs and viadea September 21, 2022 00:19
@sameerz sameerz self-assigned this Sep 21, 2022
@sameerz sameerz changed the title Update doc to indicate ORC and Parquet zstd write support Update doc to indicate ORC and Parquet zstd write support [skip ci] Sep 21, 2022
and `snappy` ORC files. At this point, the plugin does not have the ability to fall back to the
CPU when reading an unsupported compression format, and will error out in that case.
The plugin supports reading `uncompressed`, `snappy`, `zlib` and `zstd` ORC files and writing
`uncompressed`, `snappy` and `zstd` ORC files. At this point, the plugin does not have the ability
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize writing was supported but I see the cudf issue is resolved to support it, but I don't see that we have any tests for it. Perhaps we should file an issue to enable tests? Also note we still have #3037 open so we can probably close it.

Copy link
Collaborator

@tgravescs tgravescs Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is covered by https://github.com/NVIDIA/spark-rapids/pull/6362/files which sounds like still discussion if we really be enabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, there are some unanswered questions relating to output file sizes when using the ZSTD compression support, as the zstd block size used by the GPU doesn't match the default block size used by the CPU when writing Parquet+zstd.

Copy link
Collaborator Author

@sameerz sameerz Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to document that writing zstd for Parquet and ORC is experimental, and can be enabled by setting LIBCUDF_NVCOMP_POLICY=ALWAYS ?

Setting to draft until PR #6362 is merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I would almost just do the read support in this PR and the write support should be documented with the other PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @tgravescs, I see no reason to speculate in this PR when we can (and should) update the docs in the PR that adds the functionality in question. Only reason we're adding docs in this PR is that the doc updates were missed when ZSTD read support tests went in via #5898.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to only indicate zstd read support.

@sameerz sameerz marked this pull request as draft September 21, 2022 15:10
Signed-off-by: Sameer Raheja <[email protected]>
@sameerz sameerz changed the title Update doc to indicate ORC and Parquet zstd write support [skip ci] Update doc to indicate ORC and Parquet zstd read support [skip ci] Sep 21, 2022
@sameerz sameerz marked this pull request as ready for review September 21, 2022 15:40
@sameerz sameerz linked an issue Sep 21, 2022 that may be closed by this pull request
@jlowe
Copy link
Member

jlowe commented Sep 21, 2022

build

@jbrennan333
Copy link
Contributor

Note that I have a WIP PR up to add support for parquet/orc write compression in spark-rapids, but because the feature is still experimental, we may want to hold off on enabling it. See this comment.

@sameerz sameerz merged commit 6e59830 into NVIDIA:branch-22.10 Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] zstd not documented as supported with parquet and orc
4 participants