-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for list and struct type in ORC Reader #8599
Adding support for list and struct type in ORC Reader #8599
Conversation
….read_orc(...). This allows for single calls to cudf.read_orc(...) and batching multiple read operations into a single read operation and therefore a single resulting dataframe
…t be specified multiple times
…ies multiple stripes from a single ORC file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more small suggestions.
cpp/src/io/orc/orc.cpp
Outdated
uint32_t parent_idx = static_cast<uint32_t>(schema_idxs[col_id].parent); | ||
uint32_t field_idx = static_cast<uint32_t>(schema_idxs[col_id].field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but the logic here is implicit/fragile. The invalid value (-1
) is only covered in the next line because of unsigned integer underflow.
IMO there should be a validity check for schema_idxs[col_id].field
(and maybe schema_idxs[col_id].parent
, not sure) before we compare against fieldNames.size()
and potentially set the column name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 🔥 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚦 🟢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @rgsl888prabhu ! 🔥
@gpucibot merge |
The fix from #8174 got reverted in rework of `orc/reader_impl.cu` in #8599 This PR reinstates the original fix to prevent an assert in debug mode in `gtests/ORC_TEST` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Conor Hoekstra (https://github.com/codereport) URL: #8706
This PR adds support for lists and struct in ORC reader.
The columns are processed as per nesting level since in case of list, you need to extract number of child rows per stripe and number of child rows in total, before you can process them.
But in case of struct, all the child columns will have same number of rows, so struct children are processed along with the parent in the same level.
So, you will observe that there is a distinction on how child columns of list and struct are handled in the PR.
closes #8582