-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ORC chunked reader #15094
Implement ORC chunked reader #15094
Changes from 118 commits
a38b115
75cec9b
a7bd47a
db768fb
f8652d7
537ea0c
24e1552
df8d9b3
4e1614e
12dff3b
5401826
e53cf56
f2ec94c
fd325b6
416d810
b745787
d0ed05a
ae06017
6cccca3
2488cb2
e3db4dc
94d66ad
e270aa3
b307b80
fcdc9c1
915a3fc
de4a365
119002e
818cfb7
bce6e8d
f6fc6f0
bd198dc
d23591d
a581e96
6072ffa
afb4ffa
4b1665e
f8ae741
e08984f
d555b54
cfb8345
e072124
37aaeeb
4531ab3
9a80faf
6279ad6
dc6bc2e
3a89549
d1cc44c
ac97dc2
9b2bbaa
a959db2
81b78ea
5537033
41b9f52
6597699
277758e
83ba727
5dcd612
04acd0f
97f80c8
e425e41
8d77309
c4f98ee
ae665a0
883ccc0
625d0f4
974bb7f
bdb586e
1bee174
2b3b92e
17096d3
18a4e9f
4886db4
9697813
0016935
759246d
d5912b9
87fd67b
c44f0ec
b842118
248f0ef
497eea5
33aff94
d071f46
388adb3
7e451ab
758e2d0
5de8179
31f6b6d
07a095a
6a6061a
d8c7c44
c33ebce
e7958cc
2cc1fe7
5295187
fe2f55e
0ced9f4
223f078
112131f
be544f5
ead3124
74d14d1
fb92f17
07103ad
971296f
10b7ca7
0b8a2b5
5899751
9df437b
dff0235
28e631f
d8163db
3f57b5f
7a04022
bcdfab8
98d82fc
1ec9dc0
64c155a
5b361fb
1206ba1
71386a2
17c3393
86e429f
c500719
a03cb3d
cebb051
a0492fd
d2e892d
10b8558
1531f98
5e4b16f
3ec50ef
73c1a19
a897155
91f9cce
dd7e850
89a2ac0
c585c44
f339b23
9a2cee0
a0d1528
75a96d1
96274ab
246dd5b
027f899
961d468
b445ca6
30b5899
40b28fa
51c2abf
de5cf15
10945a6
9c9a3c9
bc34e40
de5ce78
79dd429
56750bd
44d8e4a
89009dd
8e8579a
c97150e
710734c
7b89094
cb41a65
d68562c
86ec436
0f78b0d
f197946
74d806b
2a67770
f76f61e
de72389
4b64637
28f7cfc
f527c99
96f89a1
6960f92
20d8e81
99afb2e
38d8748
6e658dc
734dcf3
75c5b7c
861cdcc
ee50701
b976d99
ec7303f
acd9689
b9f07a2
ab1afdc
bf5b111
4d01ad7
a06cf49
54ed4fd
0447271
33c92a9
0a16bb4
3f8a220
cbb3858
1c62ba7
15639eb
1c2bdc1
7d12de2
c426e4c
faea7bc
7bfcdf5
a915f33
69d70c5
4e94d53
70adb9c
8c05654
3b4d7f2
33f6d15
5a253bd
7a9c436
2193722
4d3ddd1
80488f9
b5343dc
bdc92a0
252d546
8ebbb2c
a793eb7
1a7c3a9
6c3bb4f
df9a6fe
673b034
767e35f
ca15afc
f34b7b6
1d19ede
4087199
3ed1e2e
6335586
437c9c0
cc174bb
1e69335
ad69236
890abb4
cb21a6d
4e64eb7
d42ed14
bd5290c
a0ca333
536078b
cf18e4c
7b8a96f
42601b2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -41,10 +41,15 @@ namespace orc::detail { | |
* @brief Class to read ORC dataset data into columns. | ||
*/ | ||
class reader { | ||
ttnghia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
private: | ||
protected: | ||
class impl; | ||
std::unique_ptr<impl> _impl; | ||
|
||
/** | ||
* @brief Default constructor, needed for subclassing. | ||
*/ | ||
reader(); | ||
|
||
public: | ||
/** | ||
* @brief Constructor from an array of datasources | ||
|
@@ -62,7 +67,7 @@ class reader { | |
/** | ||
* @brief Destructor explicitly declared to avoid inlining in header | ||
*/ | ||
~reader(); | ||
virtual ~reader(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
|
||
/** | ||
* @brief Reads the entire dataset. | ||
|
@@ -73,6 +78,67 @@ class reader { | |
table_with_metadata read(orc_reader_options const& options); | ||
}; | ||
|
||
/** | ||
* @brief The reader class that supports iterative reading of a given file. | ||
* | ||
* This class intentionally subclasses the `reader` class with private inheritance to hide the | ||
* `reader::read()` API. As such, only chunked reading APIs are supported. | ||
*/ | ||
class chunked_reader : private reader { | ||
vuule marked this conversation as resolved.
Show resolved
Hide resolved
|
||
public: | ||
/** | ||
* @brief Constructor from size limits and an array of data sources with reader options. | ||
* | ||
* The typical usage should be similar to this: | ||
* ``` | ||
* do { | ||
* auto const chunk = reader.read_chunk(); | ||
* // Process chunk | ||
* } while (reader.has_next()); | ||
* | ||
* ``` | ||
* | ||
* If `output_size_limit == 0` (i.e., no reading limit), a call to `read_chunk()` will read the | ||
* whole file and return a table containing all rows. | ||
* | ||
* TODO: data read limit | ||
* | ||
* @param output_size_limit Limit on total number of bytes to be returned per read, | ||
* or `0` if there is no limit | ||
* @param data_read_limit Limit on memory usage for the purposes of decompression and processing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is in bytes, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The doxygen should clarify its name thus I'm inclined to keep the name shorter. I even prefer |
||
* of input, or `0` if there is no limit | ||
* @param sources Input `datasource` objects to read the dataset from | ||
* @param options Settings for controlling reading behavior | ||
* @param stream CUDA stream used for device memory operations and kernel launches | ||
* @param mr Device memory resource to use for device memory allocation | ||
*/ | ||
explicit chunked_reader(std::size_t output_size_limit, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Silly question: Why is this ctor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I take my question back. From scripture, this might be to prevent construction like: chunked_reader const my_reader = { 10, 20, std::vector{...}, orc_reader_options{...} }; Do I have this right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exactly, the |
||
std::size_t data_read_limit, | ||
std::vector<std::unique_ptr<cudf::io::datasource>>&& sources, | ||
orc_reader_options const& options, | ||
rmm::cuda_stream_view stream, | ||
rmm::mr::device_memory_resource* mr); | ||
|
||
/** | ||
* @brief Destructor explicitly-declared to avoid inlined in header. | ||
* | ||
* Since the declaration of the internal `_impl` object does not exist in this header, this | ||
* destructor needs to be defined in a separate source file which can access to that object's | ||
* declaration. | ||
*/ | ||
~chunked_reader(); | ||
|
||
/** | ||
* @copydoc cudf::io::chunked_orc_reader::has_next | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for reminding that. I'll move the docs to |
||
*/ | ||
[[nodiscard]] bool has_next() const; | ||
|
||
/** | ||
* @copydoc cudf::io::chunked_orc_reader::read_chunk | ||
*/ | ||
[[nodiscard]] table_with_metadata read_chunk() const; | ||
}; | ||
|
||
/** | ||
* @brief Class to write ORC dataset data into columns. | ||
*/ | ||
|
@@ -133,5 +199,6 @@ class writer { | |
*/ | ||
void skip_close(); | ||
}; | ||
|
||
} // namespace orc::detail | ||
} // namespace cudf::io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend to remove this, or un-comment it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I still need to clean up this and much more 😄