Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Raise temporary error for decimal128 types in parquet reader #9804

Merged
merged 12 commits into from
Dec 7, 2021

Conversation

galipremsagar
Copy link
Contributor

This PR adds a decimal128 type validation in parquet reader. This is put in-place to unblock libcudf changes: #9765 and this validation will soon be removed once python side of decimal128 changes are merged(blocked by libcudf from_arrow bug).

@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer cuIO cuIO issue labels Nov 30, 2021
@galipremsagar galipremsagar requested a review from vuule November 30, 2021 23:52
@galipremsagar galipremsagar self-assigned this Nov 30, 2021
@galipremsagar galipremsagar requested a review from a team as a code owner November 30, 2021 23:52
@galipremsagar galipremsagar added breaking Breaking change feature request New feature or request labels Nov 30, 2021
@codecov
Copy link

codecov bot commented Dec 1, 2021

Codecov Report

Merging #9804 (6014e1b) into branch-22.02 (967a333) will decrease coverage by 0.05%.
The diff coverage is 5.24%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9804      +/-   ##
================================================
- Coverage         10.49%   10.43%   -0.06%     
================================================
  Files               119      119              
  Lines             20305    20447     +142     
================================================
+ Hits               2130     2133       +3     
- Misses            18175    18314     +139     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <ø> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/groupby/groupby.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/multiindex.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/json.py 0.00% <ø> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8002cbd...6014e1b. Read the comment docs.

Co-authored-by: Vukasin Milovanovic <[email protected]>
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for taking care of this!

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know anything about the performance implications of this? I don't know how expensive functions like to_arrow_schema and pq.read_metadata are, but it does seem like significant error-checking work to be done up front. Would it be possible to try reading the file, catching whatever error we get back from that, and then checking for 128-bit columns after the fact?

python/cudf/cudf/io/parquet.py Show resolved Hide resolved
@galipremsagar
Copy link
Contributor Author

galipremsagar commented Dec 7, 2021

Do we know anything about the performance implications of this? I don't know how expensive functions like to_arrow_schema and pq.read_metadata are, but it does seem like significant error-checking work to be done up front. Would it be possible to try reading the file, catching whatever error we get back from that, and then checking for 128-bit columns after the fact?

Here is a benchmark of reading 1000 columns dataframe:

In [3]: df = cudf.DataFrame(columns=range(1, 1000))

In [4]: df
Out[4]: d
Empty DataFrame
Columns: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999]
Index: []

In [5]: df[1] = ['a', 'b', 'c']

In [6]: df
Out[6]: 
  1     2     3     4     5     6     7     8     9     10    11    12    13    14    15   ...   985   986   987   988   989   990   991   992   993   994   995   996   997   998   999
0   a  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  ...  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
1   b  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  ...  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2   c  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  ...  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>

[3 rows x 999 columns]

In [10]: df.columns = df.columns.astype('str')

In [11]: df.to_parquet('long.parquet')

# Before
In [12]: %timeit cudf.read_parquet("long.parquet")
151 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# This PR
In [3]: %timeit cudf.read_parquet("long.parquet")
171 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I thought it was OK to introduce this overhead since this is a temporary one and will soon be gone once python support is enabled for Decimal128Column. But feel free to let me know your thoughts on this as I could be wrong too.

@vuule vuule requested a review from vyasr December 7, 2021 19:10
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was OK to introduce this overhead since this is a temporary one and will soon be gone once python support is enabled for Decimal128Column. But feel free to let me know your thoughts on this as I could be wrong too.

Thanks for the context, I missed that. A 10-15% performance regression is pretty undesirable, but not as big a deal if it's temporary. I think that as long as we anticipate the Decimal128 issues being resolved in this release it should be fine. I wouldn't want to release 22.02 with this performance regression in it though. Do you or @vuule know what the timeline is?

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved
@galipremsagar
Copy link
Contributor Author

I wouldn't want to release 22.02 with this performance regression in it though. Do you or @vuule know what the timeline is?

Yup, it is being targeted for 22.02

@vuule
Copy link
Contributor

vuule commented Dec 7, 2021

A 10-15% performance regression is pretty undesirable, but not as big a deal if it's temporary

IMO it's fine, as this is 15% with an empty dataframe, and the overhead does not increase with the number of rows. So it should be negligible in most cases.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 10-15% performance regression is pretty undesirable, but not as big a deal if it's temporary
IMO it's fine, as this is 15% with an empty dataframe, and the overhead does not increase with the number of rows. So it should be negligible in most cases.

It's not actually empty, right? @galipremsagar added enough data that it will actually write out a meaningful file with many columns, so it's not a trivial read. If the overhead doesn't scale with the number of rows then I agree that it should become negligible for a reasonable size though.

In any case, as long as we're targeting this for 22.02 the discussion is moot and I'm fine with this solution.

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Dec 7, 2021
@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ba3aedb into rapidsai:branch-22.02 Dec 7, 2021
rapids-bot bot pushed a commit that referenced this pull request Dec 8, 2021
Closes #9566
Depends on #9804

Read decimal columns as 128bit when the input width requires it.
Write decimal128 columns as `FIXED_LEN_BYTE_ARRAY`.
Use the smallest viable decimal size to read `FIXED_LEN_BYTE_ARRAY` (used to default to decimal64, even when 32bits are sufficient).
Removes `strict_decimal_types` option from Parquet reader, we can now always read using the exact decimal type.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Devavret Makkar (https://github.com/devavret)
  - MithunR (https://github.com/mythrocks)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - https://github.com/nvdbaranec

URL: #9765
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change cuIO cuIO issue feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants