Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding MAP type support for ORC Reader #9132

Merged
merged 27 commits into from
Sep 9, 2021

Conversation

rgsl888prabhu
Copy link
Contributor

Since cuDF still doesn't support MAP type, this will be viewed as list of structs, and struct having key and value pair as members.

#8826 Completes Reader part of MAP support.

@rgsl888prabhu rgsl888prabhu self-assigned this Aug 27, 2021
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 27, 2021
@rgsl888prabhu rgsl888prabhu marked this pull request as ready for review August 27, 2021 07:26
@rgsl888prabhu rgsl888prabhu requested review from a team as code owners August 27, 2021 07:26
@rgsl888prabhu rgsl888prabhu added 4 - Needs cuIO Reviewer cuIO cuIO issue feature request New feature or request non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Aug 27, 2021
@rgsl888prabhu rgsl888prabhu requested review from vuule and devavret and removed request for harrism August 27, 2021 07:27
@vuule
Copy link
Contributor

vuule commented Aug 27, 2021

Returning info on which columns are maps might be a part of the requirements (means a switch to table_input_metadata). Will follow up on this.

@codecov
Copy link

codecov bot commented Aug 27, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@1935a8a). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9132   +/-   ##
===============================================
  Coverage                ?   10.77%           
===============================================
  Files                   ?      115           
  Lines                   ?    19138           
  Branches                ?        0           
===============================================
  Hits                    ?     2062           
  Misses                  ?    17076           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1935a8a...95eec44. Read the comment docs.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just got a few nitpicks.

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Show resolved Hide resolved
"columns",
[None, ["lvl1_map", "lvl2_struct_map"], ["lvl2_struct_map", "lvl2_map"]],
)
@pytest.mark.parametrize("num_rows", [0, 15, 1005, 10561, 100000])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100000 is enough to have multiple stripes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it has 98 stripes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you mean rowgroups, 98 stripes make a ~25GB file :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it's writing with tiny stripes, now I see :D
carry on, then :)

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 1, 2021
case orc::DECIMAL:
if (type == type_id::DECIMAL64) {
scale = -static_cast<int32_t>(_metadata->get_types()[orc_col_id].scale.value_or(0));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to throw an exception if type == type_id::DECIMAL32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of orc:Decimal, there are chances where they expect us to return Double. And orc supports only Decimal64 and Decimal128 if I am not wrong.


schema = po.Struct(**schema)

lvl1_map = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you'd like to parametrize on lvl_map types. This right now has code inside and outside the test param gen that depend on each other. Instead you could have tuples of orc file buffers and expected values as separate params.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to avoid generating the buffer multiple times since we are looking for 100000 rows, just to execution time.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. A couple very minor suggestions, but nothing worth holding up the PR over. Feel free to resolve them however you'd like and merge.

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
@@ -176,7 +176,8 @@ class reader::impl {
*/
column_buffer&& assemble_buffer(const int32_t orc_col_id,
std::vector<std::vector<column_buffer>>& col_buffers,
const size_t level);
const size_t level,
rmm::cuda_stream_view stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a default, or will we always be calling it internally with a stream?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is only for internal use only.

Co-authored-by: Vyas Ramasubramani <[email protected]>
@vuule
Copy link
Contributor

vuule commented Sep 8, 2021

rerun tests

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Sep 8, 2021
@vuule
Copy link
Contributor

vuule commented Sep 9, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 473063f into rapidsai:branch-21.10 Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants