Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding optional parquet reader schema #11524

Merged

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Aug 12, 2022

Description

Adding a schema for reading parquet files. This is useful for things like binary data reading where the default behavior of cudf is to read it as a string column, but users wish to read it as a list column instead. Using a schema allows for nested data types to be expressed completely.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@hyperbolic2346 hyperbolic2346 requested review from a team as code owners August 12, 2022 19:14
@hyperbolic2346 hyperbolic2346 self-assigned this Aug 12, 2022
@github-actions github-actions bot added Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 12, 2022
@hyperbolic2346 hyperbolic2346 added feature request New feature or request 3 - Ready for Review Ready for review by team cuIO cuIO issue breaking Breaking change labels Aug 12, 2022
@revans2
Copy link
Contributor

revans2 commented Aug 12, 2022

Why does a reader change fix #11506 that complains about the writer not working? Even if this does fix it too. Shouldn't they be separate PRs?

@codecov
Copy link

codecov bot commented Aug 12, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@c19c8c9). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11524   +/-   ##
===============================================
  Coverage                ?   86.36%           
===============================================
  Files                   ?      145           
  Lines                   ?    22949           
  Branches                ?        0           
===============================================
  Hits                    ?    19820           
  Misses                  ?     3129           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpicks. A couple of minor changes.

I suspect the JNI side might be missing a change, though. (I should sync with @hyperbolic2346 to confirm.)

cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/types.hpp Show resolved Hide resolved
cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/column_buffer.cpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/column_buffer.cpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/column_buffer.cpp Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/TableJni.cpp Show resolved Hide resolved
@hyperbolic2346
Copy link
Contributor Author

Why does a reader change fix #11506 that complains about the writer not working? Even if this does fix it too. Shouldn't they be separate PRs?

Moved fix to PR #11526

@vuule vuule self-requested a review August 15, 2022 23:33
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
As expected, manually creating a schema is difficult, but the current API makes it manageable. I have no reasonable ideas on how to improve it further.
Thank you for taking the time to clean up the test code! 🔥

auto expected = std::make_unique<table>(std::move(cols));
EXPECT_EQ(1, expected->num_columns());
auto expected = table_view{{col}};
EXPECT_EQ(1, expected.num_columns());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a part of the review, just taking the opportunity to mention this. This is a weird check, right? IMO it's testing table(previously)/table_view(now) constructor, rather then anything in Parquet. I would be happier without these asserts. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought they were generally in here to prevent someone coming and adding a second column and not updating the rest of the test. It is an odd thing to see though, because why would you simply add a column without even looking at the rest of the test? I'd be happy to pull them out in this PR if you desire. Not much work and not too much point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to do it in a different PR, this one looks good to go:)

Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve but can't verify if there will be compiler issue with missing <optional> header. But I believe that we can fix it quickly if there is any.

@hyperbolic2346
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4178a51 into rapidsai:branch-22.10 Aug 16, 2022
@hyperbolic2346 hyperbolic2346 deleted the mwilson/parquet_writer_fixes branch August 16, 2022 18:57
rapids-bot bot pushed a commit that referenced this pull request Aug 16, 2022
As noticed in review of #11524 there are unnecessary asserts in the parquet tests. This removes those.

closes #11541

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11544
@ttnghia
Copy link
Contributor

ttnghia commented Aug 16, 2022

Approve but can't verify if there will be compiler issue with missing <optional> header. But I believe that we can fix it quickly if there is any.

Just verified in local build: no compiler issue 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change cuIO cuIO issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants