Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Mixed precision Decimal math support in cudf Python #7680

Open
randerzander opened this issue Mar 23, 2021 · 8 comments · Fixed by #7732 · May be fixed by #7859
Open

[FEA] Mixed precision Decimal math support in cudf Python #7680

randerzander opened this issue Mar 23, 2021 · 8 comments · Fixed by #7732 · May be fixed by #7859
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@randerzander
Copy link
Contributor

randerzander commented Mar 23, 2021

Using recent cudf nightly conda package (0.19.0a+250.g8632ca0da3):

Int & Decimal Addition:

import cudf
from cudf.core.dtypes import Decimal64Dtype

df = cudf.DataFrame({'val': [0.01, 0.02, 0.03]})

df['dec_val'] = df['val'].astype(Decimal64Dtype(7,2))
df['dec_val'] + 1

Result:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-d4f1761193b6> in <module>
----> 1 df['val'] + 1

/conda/lib/python3.8/site-packages/cudf/core/series.py in __add__(self, other)
   1600 
   1601     def __add__(self, other):
-> 1602         return self._binaryop(other, "add")
   1603 
   1604     def radd(self, other, fill_value=None, axis=0):

/conda/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

/conda/lib/python3.8/site-packages/cudf/core/series.py in _binaryop(self, other, fn, fill_value, reflect, can_reindex)
   1515         else:
   1516             lhs, rhs = self, other
-> 1517         rhs = self._normalize_binop_value(rhs)
   1518 
   1519         if fn == "truediv":

/conda/lib/python3.8/site-packages/cudf/core/series.py in _normalize_binop_value(self, other)
   2307             return cudf.Scalar(other, dtype=self.dtype)
   2308         else:
-> 2309             return self._column.normalize_binop_value(other)
   2310 
   2311     def eq(self, other, fill_value=None, axis=0):

AttributeError: 'DecimalColumn' object has no attribute 'normalize_binop_value'

Workaround:

import cudf
from cudf.core.dtypes import Decimal64Dtype

df = cudf.DataFrame({'val': [0.01, 0.02, 0.03]})

df['dec_val'] = df['val'].astype(Decimal64Dtype(7,2))
df['ones'] = 1.00
df['dec_val'] + df['ones'].astype(Decimal64Dtype(7,0))
0    1.01
1    1.02
2    1.03
dtype: decimal

Decimal & Float Multiplication:

import cudf
from cudf.core.dtypes import Decimal64Dtype

df = cudf.DataFrame({'val': [0.01, 0.02, 0.03]})

df['dec_val'] = df['val'].astype(Decimal64Dtype(7,2))
df['val'] * df['dec_val']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-4680a31be74b> in <module>
----> 1 df['val'] * df['dec_val']

/conda/lib/python3.8/site-packages/cudf/core/series.py in __mul__(self, other)
   1799 
   1800     def __mul__(self, other):
-> 1801         return self._binaryop(other, "mul")
   1802 
   1803     def rmul(self, other, fill_value=None, axis=0):

/conda/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

/conda/lib/python3.8/site-packages/cudf/core/series.py in _binaryop(self, other, fn, fill_value, reflect, can_reindex)
   1542                     rhs = rhs.fillna(fill_value)
   1543 
-> 1544         outcol = lhs._column.binary_operator(fn, rhs, reflect=reflect)
   1545         result = lhs._copy_construct(data=outcol, name=result_name)
   1546         return result

/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py in binary_operator(self, binop, rhs, reflect)
    108             ):
    109                 msg = "{!r} operator not supported between {} and {}"
--> 110                 raise TypeError(msg.format(binop, type(self), type(rhs)))
    111             out_dtype = np.result_type(self.dtype, rhs.dtype)
    112             if binop in ["mod", "floordiv"]:

TypeError: 'mul' operator not supported between <class 'cudf.core.column.numerical.NumericalColumn'> and <class 'cudf.core.column.decimal.DecimalColumn'

Workaround:

import cudf
from cudf.core.dtypes import Decimal64Dtype

df = cudf.DataFrame({'val': [0.01, 0.02, 0.03]})

df['dec_val'] = df['val'].astype(Decimal64Dtype(7,2))
df['dec_val'] * df['val'].astype(Decimal64Dtype(7, 2))
0    0.0001
1    0.0004
2    0.0009
dtype: decimal
@randerzander randerzander added bug Something isn't working Python Affects Python cuDF API. labels Mar 23, 2021
@randerzander randerzander changed the title [BUG] Can't do Decimal math in cudf Python [FEA] Mixed precision Decimal math support in cudf Python Mar 23, 2021
@randerzander randerzander added feature request New feature or request and removed bug Something isn't working labels Mar 23, 2021
@brandon-b-miller
Copy link
Contributor

Reopening this as there's a piece of the ask here that isn't implemented yet: Decimal vs int binary ops.

@brandon-b-miller
Copy link
Contributor

@randerzander PR #7859 should close the second part of this - however I think we decided we can't do decimal<->float as we'd need to do some implicit rounding for the user there.

That said since integers are exact numbers I went ahead and added that.

@randerzander
Copy link
Contributor Author

@brandon-b-miller I understand the concern about users not realizing rounding happens in an implicit cast, but it would be nice to allow configurable implicit cast behavior.

For what it's worth, Spark automatically converts (somewhat surprisingly) to a Double:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .getOrCreate()

df = spark.createDataFrame(
    [
        (1, 1.0), # create your data here, be consistent in the types.
        (2, 2.0),
    ],
    ['id', 'doubleCol'] # add your columns label here
)

df = df.withColumn('floatCol', df['doubleCol'].cast('float'))
df = df.withColumn('decCol', df['doubleCol'].cast('decimal(7,2)'))

df.withColumn('mixedResult', df['floatCol'] + df['decCol']).schema
StructType(List(StructField(id,LongType,true),StructField(doubleCol,DoubleType,true),StructField(floatCol,FloatType,true),StructField(decCol,DecimalType(7,2),true),StructField(mixedResult,DoubleType,true)))

@brandon-b-miller
Copy link
Contributor

@randerzander thank you for that example. As long as there's some authoritative source for what the rules should be, I'd be comfortable adopting that standard and allowing the behavior. Let me dig into spark a bit and figure out where it derives its casting rules and then follow up here.

@brandon-b-miller
Copy link
Contributor

brandon-b-miller commented Apr 16, 2021

So ideally here we'd like to follow spark's rules because they seem fairly robust and also, because we try and avoid inventing casting rules. Here are those rules (github link)

 * In addition, when mixing non-decimal types with decimals, we use the following rules:
 * - BYTE gets turned into DECIMAL(3, 0)
 * - SHORT gets turned into DECIMAL(5, 0)
 * - INT gets turned into DECIMAL(10, 0)
 * - LONG gets turned into DECIMAL(20, 0)
 * - FLOAT and DOUBLE cause fixed-length decimals to turn into DOUBLE
 * - Literals INT and LONG get turned into DECIMAL with the precision strictly needed by the value

This basically means that when performing a binary op between a decimal column and an integer column, the integer column is cast to decimal with precision p, where p is the maximum number of digits in the maximum representable value corresponding to the integer columns dtype.

The problem is we only support precision 18, and this number is 9223372036854775807 which contains 19 digits. Technically that number itself is representable as decimal in libcudf, but a Decimal64Dtype(19,0) isn't valid in cuDF python, because it implies that any 19 digit number can be represented.

This leaves us at a bit of an impasse because if we try and cast an int64 column to decimal in the way the spark rules specify, we run into our own constraint. Fundamentally spark doesn't have this problem because it supports 38 digits of precision. The problem gets worse when we consider the rules for precision in decimal arithmetic ops. Basically most ops tend to create a result that has a higher precision than the inputs, so even if we elected to use precision 18 for int64 since it's technically safe there's not much we could do with it at that point, since the resulting precision isn't compatible with cudf.

There seem to be 3 options:

  1. Use spark's rules and disable ops involving one decimal and one int64 or uint64 column
  2. Don't use spark's rules, invent our own, possibly scan the data and cast to the minimum required precision
  3. Wait for 128 bit types

None of these seem like great options for me, but I am open to opinions.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@vyasr
Copy link
Contributor

vyasr commented Jul 13, 2022

@shwina @isVoid this seems like another potential candidate for #11193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment