-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement chunked ORC reader #10723
Implement chunked ORC reader #10723
Conversation
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
f31fe5c
to
67ec36f
Compare
# Conflicts: # docs/additional-functionality/advanced_configs.md # sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
build |
I ran an NDS benchmark at scale
and
BenchmarkWith ORC input format:
With Parquet input format:
Before this PR, the benchmark fails with a lot of exceptions being thrown as the plugin couldn't read large files. Note that chunked ORC reader can be further optimized to improve performance, which is already planned in the future work. |
This implements the ORC chunked reader which supports reading ORC files in an interactive manner. This allows to read very large files (containing more than 2B rows) while maintaining device memory usage within a constant limit.
Depends on:
Closes #7131.