New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[REVIEW] Avoid `decimal` type narrowing for decimal binops #10299

Merged

rapids-bot merged 11 commits into rapidsai:branch-22.04 from galipremsagar:10282

Feb 23, 2022

Contributor

galipremsagar commented Feb 15, 2022

Fixes: #10282

This PR removes decimal type narrowing and also updates the tests accordingly.

galipremsagar added 2 commits

February 15, 2022 09:32


          change type handling

7b084c1


          change comment

1f5ca2b

galipremsagar added 3 - Ready for Review Python 4 - Needs cuDF (Python) Reviewer improvement breaking labels

galipremsagar requested a review from shwina

February 15, 2022 17:37

galipremsagar requested a review from a team as a code owner

February 15, 2022 17:37

galipremsagar self-assigned this

galipremsagar requested a review from charlesbluca

February 15, 2022 17:37

galipremsagar added 2 commits

February 15, 2022 11:38


          Update decimal.py

6d2cbab


          Merge branch 'rapidsai:branch-22.04' into 10282

7f6ec3e

codecov bot commented Feb 15, 2022 •

edited

Loading

Codecov Report

Merging #10299 (56987a4) into branch-22.04 (203f7b0) will decrease coverage by 0.05%.
The diff coverage is 0.00%.

@@               Coverage Diff                @@
##           branch-22.04   #10299      +/-   ##
================================================
- Coverage         10.67%   10.62%   -0.06%     
================================================
  Files               122      122              
  Lines             20878    20977      +99     
================================================
  Hits               2228     2228              
- Misses            18650    18749      +99

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/decimal.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/_base_index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/lists.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/column.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/string.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/indexed_frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/groupby/groupby.py	`0.00% <0.00%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 203f7b0...56987a4. Read the comment docs.

galipremsagar added 2 commits

February 15, 2022 13:58


          Merge branch 'rapidsai:branch-22.04' into 10282

d43e944


          Merge branch 'rapidsai:branch-22.04' into 10282

840630b

bdice requested changes

View reviewed changes

python/cudf/cudf/core/column/decimal.py

		@@ -364,18 +364,51 @@ def _get_decimal_type(lhs_dtype, rhs_dtype, op):
		else:

Contributor

bdice Feb 16, 2022 •

edited

Loading

I see a comment above,

This should at some point be hooked up to libcudf's binary_operation_fixed_point_scale

Do we only support add/sub/mul/div operations right now in Python because of limitations in this function? I know that other operations are implemented in libcudf, so piping that through might be a significant improvement.

Contributor Author

galipremsagar Feb 16, 2022

Do we only support add/sub/mul/div operations right now in Python because of limitations in this function?

Not just binary_operation_fixed_point_scale but I think support for other binop's are not supported from libcudf side.

Looking into binary_operation_fixed_point_scale, it seems the formula for DIV is wrong? I could be wrong here but don't match what is specified here: https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql

Though libcudf doesn't take precision as input the python side will need calculation so probably better to have those two computations in a single place rather than having to have to look at two places.

Contributor

bdice Feb 18, 2022 •

edited

Loading

Support for other operators exists, e.g. MOD / PMOD / PYMOD: #10179.

I'm fine with keeping both precision/scale calculations together here. I just wanted to make a note to ask, since I saw the comment above.

There may or may not be issues with the scale/precision calculations. I think the page you referenced has different conventions than libcudf. In my understanding:

libcudf's scale represents powers of the radix (base 10 or base 2)
libcudf's precision (32, 64, 128) represents bits (powers of two) used to store the integral part

Neither value appears to correspond to the linked SQL docs. That page appears to always use powers of 10 for both scale and precision. Also the definition of scale is the negative of libcudf's definition. It does not surprise me that these different conventions would result in different expressions. I spent an hour looking into this but I have no idea how to make the two definitions mathematically correspond.

Working through an example calculation here, for the SQL docs:

e1 = 4.096
p1 = 4
s1 = 3
e2 = 3.2
p2 = 2
s2 = 1
s = max(6, s1 + p2 + 1)
p = p1 - s1 + s2 + s
print(f"{e1/e2=}")  # e1/e2=1.28
print(f"{p=}, {s=}")  # p=8, s=6

I was confused and gave up at this point -- how could 1.28 have p=8, s=6?

Contributor Author

galipremsagar Feb 23, 2022

Thanks @bdice, I think @codereport would have a better understanding on this than me. But I'm merging these changes for now and we can have a follow-up PR if changes need to be done.

python/cudf/cudf/core/column/decimal.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/column/decimal.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/column/decimal.py Outdated

-                          pass
-                      else:
-                          return min_decimal_type
+                      if decimal_type.MAX_PRECISION >= lhs_rhs_max_precision:

Contributor

bdice Feb 16, 2022

It looks like this if is checking the same thing as the _validate method in the decimal dtype constructor. Is this unnecessarily duplicated? I'd fall back on the try and remove the if if possible.

Contributor

bdice Feb 16, 2022 •

edited

Loading

I might be wrong here -- I see you're constructing the returned type with precision=precision instead of precision=max_precision. Would it be better to try and construct a type with max_precision and return a type with precision if that succeeds? (Or is that a bug -- should it be returning a type with max_precision?)

Contributor Author

galipremsagar Feb 16, 2022 •

edited

Loading

I might be wrong here -- I see you're constructing the returned type with precision=precision instead of precision=max_precision. Would it be better to try and construct a type with max_precision and return a type with precision if that succeeds? (Or is that a bug -- should it be returning a type with max_precision?)

It's not a bug, the dtype is expected to have precision and not max_precision

It looks like this if is checking the same thing as the _validate method in the decimal dtype constructor. Is this unnecessarily duplicated? I'd fall back on the try and remove the if if possible.

This was a necessary duplication because we want to pick a dtype that is not less than lhs_dtype or rhs_dtype. i.e., avoid type narrowing.

python/cudf/cudf/core/column/decimal.py Outdated Show resolved Hide resolved

galipremsagar and others added 4 commits

February 16, 2022 14:10


          Update python/cudf/cudf/core/column/decimal.py

a2c6e8a

Co-authored-by: Bradley Dice <[email protected]>


          Update python/cudf/cudf/core/column/decimal.py

bd54b5e

Co-authored-by: Bradley Dice <[email protected]>


          Update python/cudf/cudf/core/column/decimal.py

c12f9f7

Co-authored-by: Bradley Dice <[email protected]>


          cleanup

71572ec

shwina reviewed

View reviewed changes

python/cudf/cudf/core/column/decimal.py Outdated

Comment on lines 400 to 406

+                      if decimal_type.MAX_PRECISION >= max_precision:
+                          try:
+                              return decimal_type(precision=precision, scale=scale)
+                          except ValueError:
+                              # Call to _validate fails, which means we need
+                              # to try the next dtype
+                              pass

Contributor

shwina Feb 16, 2022

For a problem larger than this, I would suggest something like bisect to determine the type corresponding to a certain precision, but I think this is fine.

galipremsagar requested a review from bdice

February 17, 2022 23:07

bdice approved these changes

View reviewed changes

Contributor

bdice left a comment

I have only a couple minor suggestions. I shared a longer comment about libcudf's decimal conventions but I'm not sure if there's anything actionable there based on what I know.

python/cudf/cudf/core/column/decimal.py

		@@ -364,18 +364,51 @@ def _get_decimal_type(lhs_dtype, rhs_dtype, op):
		else:

Contributor

bdice Feb 18, 2022 •

edited

Loading

Support for other operators exists, e.g. MOD / PMOD / PYMOD: #10179.

I'm fine with keeping both precision/scale calculations together here. I just wanted to make a note to ask, since I saw the comment above.

There may or may not be issues with the scale/precision calculations. I think the page you referenced has different conventions than libcudf. In my understanding:

libcudf's scale represents powers of the radix (base 10 or base 2)
libcudf's precision (32, 64, 128) represents bits (powers of two) used to store the integral part

Neither value appears to correspond to the linked SQL docs. That page appears to always use powers of 10 for both scale and precision. Also the definition of scale is the negative of libcudf's definition. It does not surprise me that these different conventions would result in different expressions. I spent an hour looking into this but I have no idea how to make the two definitions mathematically correspond.

Working through an example calculation here, for the SQL docs:

e1 = 4.096
p1 = 4
s1 = 3
e2 = 3.2
p2 = 2
s2 = 1
s = max(6, s1 + p2 + 1)
p = p1 - s1 + s2 + s
print(f"{e1/e2=}")  # e1/e2=1.28
print(f"{p=}, {s=}")  # p=8, s=6

I was confused and gave up at this point -- how could 1.28 have p=8, s=6?

python/cudf/cudf/core/column/decimal.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/column/decimal.py Outdated Show resolved Hide resolved


          Apply suggestions from code review

56987a4

Co-authored-by: Bradley Dice <[email protected]>

galipremsagar removed 3 - Ready for Review 4 - Needs cuDF (Python) Reviewer labels

galipremsagar added the 5 - Ready to Merge label

Contributor Author

galipremsagar commented Feb 23, 2022

@gpucibot merge

rapids-bot bot merged commit 496f452 into rapidsai:branch-22.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge breaking improvement Python