Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for list and struct type in ORC Reader #8599

Merged

Conversation

rgsl888prabhu
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu commented Jun 24, 2021

This PR adds support for lists and struct in ORC reader.

The columns are processed as per nesting level since in case of list, you need to extract number of child rows per stripe and number of child rows in total, before you can process them.

But in case of struct, all the child columns will have same number of rows, so struct children are processed along with the parent in the same level.

So, you will observe that there is a distinction on how child columns of list and struct are handled in the PR.

closes #8582

jdye64 and others added 30 commits April 25, 2021 11:02
….read_orc(...). This allows for single calls to cudf.read_orc(...) and batching multiple read operations into a single read operation and therefore a single resulting dataframe
@rgsl888prabhu rgsl888prabhu added the 2 - In Progress Currently a work in progress label Jul 2, 2021
@rgsl888prabhu rgsl888prabhu added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 3, 2021
@rgsl888prabhu rgsl888prabhu requested review from devavret and vuule July 3, 2021 21:09
@rapidsai rapidsai deleted a comment from rgsl888prabhu Jul 4, 2021
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small suggestions.

Comment on lines 472 to 473
uint32_t parent_idx = static_cast<uint32_t>(schema_idxs[col_id].parent);
uint32_t field_idx = static_cast<uint32_t>(schema_idxs[col_id].field);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but the logic here is implicit/fragile. The invalid value (-1) is only covered in the next line because of unsigned integer underflow.
IMO there should be a validity check for schema_idxs[col_id].field (and maybe schema_idxs[col_id].parent, not sure) before we compare against fieldNames.size() and potentially set the column name.

cpp/src/io/orc/reader_impl.cu Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🔥 🔥

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚦 🟢

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @rgsl888prabhu ! 🔥

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Jul 6, 2021
@vuule
Copy link
Contributor

vuule commented Jul 6, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e855eb9 into rapidsai:branch-21.08 Jul 6, 2021
rapids-bot bot pushed a commit that referenced this pull request Jul 12, 2021
The fix from #8174 got reverted in rework of `orc/reader_impl.cu` in #8599 
This PR reinstates the original fix to prevent an assert in debug mode in `gtests/ORC_TEST`

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Conor Hoekstra (https://github.com/codereport)

URL: #8706
@vuule vuule added the cuIO cuIO issue label Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support list types in ORC reader
5 participants