Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Adding decimal32 and decimal64 support to parquet reading #6808

Merged

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Nov 19, 2020

This PR adds support for reading decimals in parquet into decimal32 and decimal64 cudf types. A test was added to test these types by embedding a parquet data file into the cpp file. This is temporary until python supports decimal and the tests move there.

partially closes issue #6474

@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner November 19, 2020 19:38
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@codecov
Copy link

codecov bot commented Nov 20, 2020

Codecov Report

Merging #6808 (11e53ea) into branch-0.17 (e01ab96) will decrease coverage by 0.33%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.17    #6808      +/-   ##
===============================================
- Coverage        82.31%   81.97%   -0.34%     
===============================================
  Files               93       96       +3     
  Lines            15358    16181     +823     
===============================================
+ Hits             12642    13265     +623     
- Misses            2716     2916     +200     
Impacted Files Coverage Δ
python/cudf/cudf/benchmarks/bench_cudf_io.py 30.61% <0.00%> (-12.25%) ⬇️
python/cudf/cudf/io/orc.py 89.39% <0.00%> (-8.23%) ⬇️
python/cudf/cudf/utils/ioutils.py 79.88% <0.00%> (-6.23%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.68% <0.00%> (-0.67%) ⬇️
python/cudf/cudf/core/column/string.py 86.30% <0.00%> (-0.59%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.53% <0.00%> (-0.46%) ⬇️
python/dask_cudf/dask_cudf/io/parquet.py 91.07% <0.00%> (-0.29%) ⬇️
python/cudf/cudf/core/column/datetime.py 88.55% <0.00%> (-0.22%) ⬇️
python/cudf/cudf/core/tools/datetimes.py 81.60% <0.00%> (-0.15%) ⬇️
python/cudf/cudf/io/parquet.py 91.66% <0.00%> (-0.07%) ⬇️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73cca47...11e53ea. Read the comment docs.

@codereport codereport self-requested a review November 20, 2020 15:15
@jlowe jlowe added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Nov 24, 2020
@hyperbolic2346 hyperbolic2346 changed the title [WIP] Adding decimal32 and decimal64 support to parquet reading [REVIEW] Adding decimal32 and decimal64 support to parquet reading Nov 30, 2020
@hyperbolic2346
Copy link
Contributor Author

Python tests are blocked on #6715

@hyperbolic2346
Copy link
Contributor Author

Current plan is to check in a ~2k test file to read in cpp until python tests are available.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly doc change suggestions. I think we need a cuIO reviewer for this PR.

cpp/include/cudf/fixed_point/fixed_point.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved
cpp/include/cudf_test/column_wrapper.hpp Outdated Show resolved Hide resolved
cpp/include/cudf_test/column_wrapper.hpp Outdated Show resolved Hide resolved
@harrism
Copy link
Member

harrism commented Dec 1, 2020

Would this be a non-breaking API change or a breaking change? I assume the former since it's only adding functionality. We need to know to set appropriate labels for the automerger.

@harrism harrism added the feature request New feature or request label Dec 1, 2020
Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cpp/include/cudf/fixed_point/fixed_point.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved
@hyperbolic2346 hyperbolic2346 added the non-breaking Non-breaking change label Dec 1, 2020
Adding suggestions from review for comment changes.

Co-authored-by: Mark Harris <[email protected]>
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Some suggestions related to test coverage.

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Show resolved Hide resolved
…ere is no exception thrown.

Added some commented code to attempt testing the double reading code.
Implemented the great suggestions of explicit conversion to std::string for output in tests.
@harrism harrism added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuIO Reviewer labels Dec 3, 2020
@harrism
Copy link
Member

harrism commented Dec 3, 2020

I didn't add any tests in cpp for this because without a parquet file stored in a directory,

@hyperbolic2346 I think the description for this PR is out of date based on the above. Can you please update it since it will be used in the merge commit message?

@sperlingxx
Copy link
Contributor

I think we also need enable set_strict_decimal_types at JNI level. I am not sure whether this should be done in this PR or a separate PR.

@revans2
Copy link
Contributor

revans2 commented Dec 3, 2020

I think we also need enable set_strict_decimal_types at JNI level. I am not sure whether this should be done in this PR or a separate PR.

I think that needs to be a separate PR to avoid blocking this from going in.

@rapids-bot rapids-bot bot merged commit b9ef96c into rapidsai:branch-0.17 Dec 4, 2020
@hyperbolic2346 hyperbolic2346 deleted the mwilson/decimal_parquet_read branch December 4, 2020 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants