-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Parquet reader #12046
Refactor Parquet reader #12046
Conversation
…taining a mix of nested and non-nested types would result in incorrect row counts for the non-nested types. Also optimizes the preprocess path so that non-nested types do not end up getting visited by the kernel.
…ists. Fixed an additional issue in the decoding where flat column types underneath structs could end up ignoring skip_rows/num_rows.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]> # Conflicts: # cpp/src/io/parquet/page_data.cu # cpp/src/io/parquet/reader_impl.cu # cpp/src/io/parquet/reader_impl.hpp
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake LGTM
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very hard to review with all moved code, but (AFAICT) looks great!
Got some small suggestions, some are maybe not the kind of code improvements that this PR aims for.
size_t const start_row; // TODO source index | ||
size_type const source_index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why the TODO is still here?
Also, I don't think these members should be const
, it only does harm by preventing moves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvdbaranec Do you have any idea why TODO?
/** | ||
* @brief Function that translates Parquet datatype to cuDF type enum | ||
*/ | ||
type_id to_type_id(SchemaElement const& schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this and to_data_type
should be SchemaElement
members, but this is probably out of scope for this PR. Also, do we really need to_type_id
when we have to_data_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_type_id
is used to interpret parquet schema information and determine the corresponding cudf type during reading. to_data_type
takes a type_id
and a parquet schema to determine the eventual column logical type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots to process, but looking good.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@gpucibot merge |
This is a rather non-simple refactor of Parquet reader, no new features or changes in algorithms were made:
Note that this is merely moving the current implementation around, preparing for adding chunked Parquet reader which is a fairly large implementation.
This is also a blocker for: