[REVIEW] Raise temporary error for `decimal128` types in parquet reader #9804

galipremsagar · 2021-11-30T23:52:05Z

This PR adds a decimal128 type validation in parquet reader. This is put in-place to unblock libcudf changes: #9765 and this validation will soon be removed once python side of decimal128 changes are merged(blocked by libcudf from_arrow bug).

python/cudf/cudf/io/parquet.py

codecov · 2021-12-01T01:30:10Z

Codecov Report

Merging #9804 (6014e1b) into branch-22.02 (967a333) will decrease coverage by 0.05%.
The diff coverage is 5.24%.

@@               Coverage Diff                @@
##           branch-22.02    #9804      +/-   ##
================================================
- Coverage         10.49%   10.43%   -0.06%     
================================================
  Files               119      119              
  Lines             20305    20447     +142     
================================================
+ Hits               2130     2133       +3     
- Misses            18175    18314     +139

Impacted Files	Coverage Δ
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/_base_index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/column.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/string.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/groupby/groupby.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/indexed_frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/multiindex.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/json.py	`0.00% <ø> (ø)`
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8002cbd...6014e1b. Read the comment docs.

python/cudf/cudf/tests/test_parquet.py

…quet_patch

python/cudf/cudf/tests/test_parquet.py

…quet_patch

python/cudf/cudf/io/parquet.py

Co-authored-by: Vukasin Milovanovic <[email protected]>

vuule

Looks good, thanks for taking care of this!

vyasr

Do we know anything about the performance implications of this? I don't know how expensive functions like to_arrow_schema and pq.read_metadata are, but it does seem like significant error-checking work to be done up front. Would it be possible to try reading the file, catching whatever error we get back from that, and then checking for 128-bit columns after the fact?

python/cudf/cudf/io/parquet.py

galipremsagar · 2021-12-07T17:49:59Z

Do we know anything about the performance implications of this? I don't know how expensive functions like to_arrow_schema and pq.read_metadata are, but it does seem like significant error-checking work to be done up front. Would it be possible to try reading the file, catching whatever error we get back from that, and then checking for 128-bit columns after the fact?

Here is a benchmark of reading 1000 columns dataframe:

In [3]: df = cudf.DataFrame(columns=range(1, 1000))

In [4]: df
Out[4]: d
Empty DataFrame
Columns: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999]
Index: []

In [5]: df[1] = ['a', 'b', 'c']

In [6]: df
Out[6]: 
  1     2     3     4     5     6     7     8     9     10    11    12    13    14    15   ...   985   986   987   988   989   990   991   992   993   994   995   996   997   998   999
0   a  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  ...  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
1   b  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  ...  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2   c  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  ...  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>

[3 rows x 999 columns]

In [10]: df.columns = df.columns.astype('str')

In [11]: df.to_parquet('long.parquet')

# Before
In [12]: %timeit cudf.read_parquet("long.parquet")
151 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# This PR
In [3]: %timeit cudf.read_parquet("long.parquet")
171 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I thought it was OK to introduce this overhead since this is a temporary one and will soon be gone once python support is enabled for Decimal128Column. But feel free to let me know your thoughts on this as I could be wrong too.

vyasr

I thought it was OK to introduce this overhead since this is a temporary one and will soon be gone once python support is enabled for Decimal128Column. But feel free to let me know your thoughts on this as I could be wrong too.

Thanks for the context, I missed that. A 10-15% performance regression is pretty undesirable, but not as big a deal if it's temporary. I think that as long as we anticipate the Decimal128 issues being resolved in this release it should be fine. I wouldn't want to release 22.02 with this performance regression in it though. Do you or @vuule know what the timeline is?

python/cudf/cudf/tests/test_parquet.py

galipremsagar · 2021-12-07T19:52:44Z

I wouldn't want to release 22.02 with this performance regression in it though. Do you or @vuule know what the timeline is?

Yup, it is being targeted for 22.02

Co-authored-by: Vyas Ramasubramani <[email protected]>

vuule · 2021-12-07T19:53:29Z

A 10-15% performance regression is pretty undesirable, but not as big a deal if it's temporary

IMO it's fine, as this is 15% with an empty dataframe, and the overhead does not increase with the number of rows. So it should be negligible in most cases.

vyasr

A 10-15% performance regression is pretty undesirable, but not as big a deal if it's temporary
IMO it's fine, as this is 15% with an empty dataframe, and the overhead does not increase with the number of rows. So it should be negligible in most cases.

It's not actually empty, right? @galipremsagar added enough data that it will actually write out a meaningful file with many columns, so it's not a trivial read. If the overhead doesn't scale with the number of rows then I agree that it should become negligible for a reasonable size though.

In any case, as long as we're targeting this for 22.02 the discussion is moot and I'm fine with this solution.

galipremsagar · 2021-12-07T21:23:40Z

@gpucibot merge

Closes #9566 Depends on #9804 Read decimal columns as 128bit when the input width requires it. Write decimal128 columns as `FIXED_LEN_BYTE_ARRAY`. Use the smallest viable decimal size to read `FIXED_LEN_BYTE_ARRAY` (used to default to decimal64, even when 32bits are sufficient). Removes `strict_decimal_types` option from Parquet reader, we can now always read using the exact decimal type. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Devavret Makkar (https://github.com/devavret) - MithunR (https://github.com/mythrocks) - Charles Blackmon-Luca (https://github.com/charlesbluca) - https://github.com/nvdbaranec URL: #9765

add decimal128 validation

eb10ee1

galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer cuIO cuIO issue labels Nov 30, 2021

galipremsagar requested a review from vuule November 30, 2021 23:52

galipremsagar self-assigned this Nov 30, 2021

galipremsagar requested a review from a team as a code owner November 30, 2021 23:52

galipremsagar requested review from isVoid and charlesbluca November 30, 2021 23:52

galipremsagar added breaking Breaking change feature request New feature or request labels Nov 30, 2021

vuule mentioned this pull request Dec 1, 2021

Add decimal128 support to Parquet reader and writer #9765

Merged

vuule reviewed Dec 1, 2021

View reviewed changes

python/cudf/cudf/io/parquet.py Show resolved Hide resolved

galipremsagar added 2 commits November 30, 2021 17:13

handle nested types

7ece2ed

handle nested types

4f98d4f

vuule reviewed Dec 1, 2021

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

galipremsagar added 2 commits December 1, 2021 12:48

Merge remote-tracking branch 'upstream/branch-22.02' into decimal_par…

6de6c53

…quet_patch

add nested tests

067e9c7

vuule reviewed Dec 1, 2021

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved

galipremsagar added 4 commits December 2, 2021 08:12

Merge remote-tracking branch 'upstream/branch-22.02' into decimal_par…

cff065d

…quet_patch

Merge remote-tracking branch 'upstream/branch-22.02' into decimal_par…

f213f47

…quet_patch

Merge remote-tracking branch 'upstream/branch-22.02' into decimal_par…

e72010c

…quet_patch

address reviews

d4fc028

vuule reviewed Dec 6, 2021

View reviewed changes

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved

Update python/cudf/cudf/io/parquet.py

2ddf2cc

Co-authored-by: Vukasin Milovanovic <[email protected]>

vuule approved these changes Dec 6, 2021

View reviewed changes

galipremsagar removed the 4 - Needs cuIO Reviewer label Dec 6, 2021

vyasr reviewed Dec 6, 2021

View reviewed changes

python/cudf/cudf/io/parquet.py Show resolved Hide resolved

add comment

c9a9304

vuule requested a review from vyasr December 7, 2021 19:10

vyasr reviewed Dec 7, 2021

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved

Update python/cudf/cudf/tests/test_parquet.py

6014e1b

Co-authored-by: Vyas Ramasubramani <[email protected]>

vyasr approved these changes Dec 7, 2021

View reviewed changes

vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Dec 7, 2021

rapids-bot bot merged commit ba3aedb into rapidsai:branch-22.02 Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Raise temporary error for `decimal128` types in parquet reader #9804

[REVIEW] Raise temporary error for `decimal128` types in parquet reader #9804

galipremsagar commented Nov 30, 2021

codecov bot commented Dec 1, 2021 •

edited

Loading

vuule left a comment

vyasr left a comment

galipremsagar commented Dec 7, 2021 •

edited

Loading

vyasr left a comment

galipremsagar commented Dec 7, 2021

vuule commented Dec 7, 2021

vyasr left a comment •

edited

Loading

galipremsagar commented Dec 7, 2021

[REVIEW] Raise temporary error for decimal128 types in parquet reader #9804

[REVIEW] Raise temporary error for decimal128 types in parquet reader #9804

Conversation

galipremsagar commented Nov 30, 2021

codecov bot commented Dec 1, 2021 • edited Loading

Codecov Report

vuule left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

galipremsagar commented Dec 7, 2021 • edited Loading

vyasr left a comment

Choose a reason for hiding this comment

galipremsagar commented Dec 7, 2021

vuule commented Dec 7, 2021

vyasr left a comment • edited Loading

Choose a reason for hiding this comment

galipremsagar commented Dec 7, 2021

[REVIEW] Raise temporary error for `decimal128` types in parquet reader #9804

[REVIEW] Raise temporary error for `decimal128` types in parquet reader #9804

codecov bot commented Dec 1, 2021 •

edited

Loading

galipremsagar commented Dec 7, 2021 •

edited

Loading

vyasr left a comment •

edited

Loading