-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Raise temporary error for decimal128
types in parquet reader
#9804
[REVIEW] Raise temporary error for decimal128
types in parquet reader
#9804
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9804 +/- ##
================================================
- Coverage 10.49% 10.43% -0.06%
================================================
Files 119 119
Lines 20305 20447 +142
================================================
+ Hits 2130 2133 +3
- Misses 18175 18314 +139
Continue to review full report at Codecov.
|
Co-authored-by: Vukasin Milovanovic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for taking care of this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know anything about the performance implications of this? I don't know how expensive functions like to_arrow_schema
and pq.read_metadata
are, but it does seem like significant error-checking work to be done up front. Would it be possible to try reading the file, catching whatever error we get back from that, and then checking for 128-bit columns after the fact?
Here is a benchmark of reading 1000 columns dataframe:
I thought it was OK to introduce this overhead since this is a temporary one and will soon be gone once python support is enabled for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was OK to introduce this overhead since this is a temporary one and will soon be gone once python support is enabled for Decimal128Column. But feel free to let me know your thoughts on this as I could be wrong too.
Thanks for the context, I missed that. A 10-15% performance regression is pretty undesirable, but not as big a deal if it's temporary. I think that as long as we anticipate the Decimal128 issues being resolved in this release it should be fine. I wouldn't want to release 22.02 with this performance regression in it though. Do you or @vuule know what the timeline is?
Yup, it is being targeted for 22.02 |
Co-authored-by: Vyas Ramasubramani <[email protected]>
IMO it's fine, as this is 15% with an empty dataframe, and the overhead does not increase with the number of rows. So it should be negligible in most cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 10-15% performance regression is pretty undesirable, but not as big a deal if it's temporary
IMO it's fine, as this is 15% with an empty dataframe, and the overhead does not increase with the number of rows. So it should be negligible in most cases.
It's not actually empty, right? @galipremsagar added enough data that it will actually write out a meaningful file with many columns, so it's not a trivial read. If the overhead doesn't scale with the number of rows then I agree that it should become negligible for a reasonable size though.
In any case, as long as we're targeting this for 22.02 the discussion is moot and I'm fine with this solution.
@gpucibot merge |
Closes #9566 Depends on #9804 Read decimal columns as 128bit when the input width requires it. Write decimal128 columns as `FIXED_LEN_BYTE_ARRAY`. Use the smallest viable decimal size to read `FIXED_LEN_BYTE_ARRAY` (used to default to decimal64, even when 32bits are sufficient). Removes `strict_decimal_types` option from Parquet reader, we can now always read using the exact decimal type. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Devavret Makkar (https://github.com/devavret) - MithunR (https://github.com/mythrocks) - Charles Blackmon-Luca (https://github.com/charlesbluca) - https://github.com/nvdbaranec URL: #9765
This PR adds a
decimal128
type validation in parquet reader. This is put in-place to unblock libcudf changes: #9765 and this validation will soon be removed once python side ofdecimal128
changes are merged(blocked by libcudffrom_arrow
bug).