-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ORC chunked reader #15094
Implement ORC chunked reader #15094
Conversation
c0ed5c3
to
c75daeb
Compare
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
# Conflicts: # cpp/src/io/orc/aggregate_orc_metadata.hpp # cpp/src/io/orc/reader_impl.cu # cpp/src/io/orc/reader_impl.hpp # cpp/src/io/orc/reader_impl_preprocess.cu
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is going to take me a little while to traverse. :]
// NVBENCH_BENCH_TYPES(BM_orc_read_data, | ||
// NVBENCH_TYPE_AXES(d_type_list, | ||
// nvbench::enum_type_list<cudf::io::io_type::DEVICE_BUFFER>)) | ||
// .set_name("orc_read_decode") | ||
// .set_type_axes_names({"data_type", "io"}) | ||
// .set_min_samples(4) | ||
// .add_int64_axis("cardinality", {0, 1000}) | ||
// .add_int64_axis("run_length", {1, 32}); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend to remove this, or un-comment it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I still need to clean up this and much more 😄
cpp/include/cudf/io/detail/orc.hpp
Outdated
@@ -62,7 +67,7 @@ class reader { | |||
/** | |||
* @brief Destructor explicitly declared to avoid inlining in header | |||
*/ | |||
~reader(); | |||
virtual ~reader(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cpp/include/cudf/io/detail/orc.hpp
Outdated
* | ||
* @param output_size_limit Limit on total number of bytes to be returned per read, | ||
* or `0` if there is no limit | ||
* @param data_read_limit Limit on memory usage for the purposes of decompression and processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is in bytes, data_read_limit_bytes
might be a more descriptive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doxygen should clarify its name thus I'm inclined to keep the name shorter. I even prefer read_limit
but that seems too short and less descriptive so maybe data_read_limit
is better.
cpp/include/cudf/io/detail/orc.hpp
Outdated
* @param stream CUDA stream used for device memory operations and kernel launches | ||
* @param mr Device memory resource to use for device memory allocation | ||
*/ | ||
explicit chunked_reader(std::size_t output_size_limit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly question: Why is this ctor explicit
? This isn't a single argument ctor, so I don't know if we run the risk of accidental conversion into a chunked_reader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take my question back. From scripture, this might be to prevent construction like:
chunked_reader const my_reader = { 10, 20, std::vector{...}, orc_reader_options{...} };
Do I have this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, the explicit
keyword here is just to prevent such construction. Technically we would not have any issue with it, just the matter of code cleaness.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another set of small suggestions
Looks pretty good overall
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally finished a full pass
bunch of small questions/suggestions, nothing really blocking
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
/ok to test |
/merge |
This adds JNI implementation for chunked ORC reader, allowing to read ORC files by an iterative manner. Depends on: * #15094 Closes #12228. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #15446
This implements ORC chunked reader, to support reading ORC such that:
read_chunk()
, and has limited size which stays within a givenoutput_limit
parameter.data_read_limit
parameter, allowing to read very large ORC files without OOM.2^31
rows).Depends on:
segmented_row_bit_count
for computing row sizes by segments of rows #15169hostdevice_vector
and add more APIs #15252Partially contribute to #12228.
Benchmarks
Due to some small optimizations in ORC reader, reading ORC files all-at-once (reading the entire file into just one output table) can be a little bit faster. For example, with the benchmark
orc_read_io_compression
:When memory is limited, chunked read can help avoiding OOM but with some sort of performance trade-off. For example, for reading a table of size 500MB from file using 64MB output limits and 640 MB data read limit:
And if memory is too limited, chunked read with 8MB output limit/80MB data read limit: