-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update doc to indicate ORC and Parquet zstd read support [skip ci] #6589
Update doc to indicate ORC and Parquet zstd read support [skip ci] #6589
Conversation
Signed-off-by: Sameer Raheja <[email protected]>
docs/compatibility.md
Outdated
and `snappy` ORC files. At this point, the plugin does not have the ability to fall back to the | ||
CPU when reading an unsupported compression format, and will error out in that case. | ||
The plugin supports reading `uncompressed`, `snappy`, `zlib` and `zstd` ORC files and writing | ||
`uncompressed`, `snappy` and `zstd` ORC files. At this point, the plugin does not have the ability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize writing was supported but I see the cudf issue is resolved to support it, but I don't see that we have any tests for it. Perhaps we should file an issue to enable tests? Also note we still have #3037 open so we can probably close it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is covered by https://github.com/NVIDIA/spark-rapids/pull/6362/files which sounds like still discussion if we really be enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, there are some unanswered questions relating to output file sizes when using the ZSTD compression support, as the zstd block size used by the GPU doesn't match the default block size used by the CPU when writing Parquet+zstd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to document that writing zstd for Parquet and ORC is experimental, and can be enabled by setting LIBCUDF_NVCOMP_POLICY=ALWAYS ?
Setting to draft until PR #6362 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally I would almost just do the read support in this PR and the write support should be documented with the other PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @tgravescs, I see no reason to speculate in this PR when we can (and should) update the docs in the PR that adds the functionality in question. Only reason we're adding docs in this PR is that the doc updates were missed when ZSTD read support tests went in via #5898.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to only indicate zstd read support.
Signed-off-by: Sameer Raheja <[email protected]>
build |
Note that I have a WIP PR up to add support for parquet/orc write compression in spark-rapids, but because the feature is still experimental, we may want to hold off on enabling it. See this comment. |
Update compatibility.md to note read and write support for zstd for ORC and Parquet.
Closes issue #6570