[FEA] Mixed precision Decimal math support in cudf Python #7680

randerzander · 2021-03-23T14:36:45Z

Using recent cudf nightly conda package (0.19.0a+250.g8632ca0da3):

Int & Decimal Addition:

import cudf
from cudf.core.dtypes import Decimal64Dtype

df = cudf.DataFrame({'val': [0.01, 0.02, 0.03]})

df['dec_val'] = df['val'].astype(Decimal64Dtype(7,2))
df['dec_val'] + 1

Result:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-d4f1761193b6> in <module>
----> 1 df['val'] + 1

/conda/lib/python3.8/site-packages/cudf/core/series.py in __add__(self, other)
   1600 
   1601     def __add__(self, other):
-> 1602         return self._binaryop(other, "add")
   1603 
   1604     def radd(self, other, fill_value=None, axis=0):

/conda/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

/conda/lib/python3.8/site-packages/cudf/core/series.py in _binaryop(self, other, fn, fill_value, reflect, can_reindex)
   1515         else:
   1516             lhs, rhs = self, other
-> 1517         rhs = self._normalize_binop_value(rhs)
   1518 
   1519         if fn == "truediv":

/conda/lib/python3.8/site-packages/cudf/core/series.py in _normalize_binop_value(self, other)
   2307             return cudf.Scalar(other, dtype=self.dtype)
   2308         else:
-> 2309             return self._column.normalize_binop_value(other)
   2310 
   2311     def eq(self, other, fill_value=None, axis=0):

AttributeError: 'DecimalColumn' object has no attribute 'normalize_binop_value'

Workaround:

import cudf
from cudf.core.dtypes import Decimal64Dtype

df = cudf.DataFrame({'val': [0.01, 0.02, 0.03]})

df['dec_val'] = df['val'].astype(Decimal64Dtype(7,2))
df['ones'] = 1.00
df['dec_val'] + df['ones'].astype(Decimal64Dtype(7,0))

0    1.01
1    1.02
2    1.03
dtype: decimal

Decimal & Float Multiplication:

import cudf
from cudf.core.dtypes import Decimal64Dtype

df = cudf.DataFrame({'val': [0.01, 0.02, 0.03]})

df['dec_val'] = df['val'].astype(Decimal64Dtype(7,2))
df['val'] * df['dec_val']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-4680a31be74b> in <module>
----> 1 df['val'] * df['dec_val']

/conda/lib/python3.8/site-packages/cudf/core/series.py in __mul__(self, other)
   1799 
   1800     def __mul__(self, other):
-> 1801         return self._binaryop(other, "mul")
   1802 
   1803     def rmul(self, other, fill_value=None, axis=0):

/conda/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

/conda/lib/python3.8/site-packages/cudf/core/series.py in _binaryop(self, other, fn, fill_value, reflect, can_reindex)
   1542                     rhs = rhs.fillna(fill_value)
   1543 
-> 1544         outcol = lhs._column.binary_operator(fn, rhs, reflect=reflect)
   1545         result = lhs._copy_construct(data=outcol, name=result_name)
   1546         return result

/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py in binary_operator(self, binop, rhs, reflect)
    108             ):
    109                 msg = "{!r} operator not supported between {} and {}"
--> 110                 raise TypeError(msg.format(binop, type(self), type(rhs)))
    111             out_dtype = np.result_type(self.dtype, rhs.dtype)
    112             if binop in ["mod", "floordiv"]:

TypeError: 'mul' operator not supported between <class 'cudf.core.column.numerical.NumericalColumn'> and <class 'cudf.core.column.decimal.DecimalColumn'

Workaround:

import cudf
from cudf.core.dtypes import Decimal64Dtype

df = cudf.DataFrame({'val': [0.01, 0.02, 0.03]})

df['dec_val'] = df['val'].astype(Decimal64Dtype(7,2))
df['dec_val'] * df['val'].astype(Decimal64Dtype(7, 2))

0    0.0001
1    0.0004
2    0.0009
dtype: decimal

The text was updated successfully, but these errors were encountered:

…pe (#7732) Closes #7680 Authors: - https://github.com/brandon-b-miller - Michael Wang (https://github.com/isVoid) Approvers: - Keith Kraus (https://github.com/kkraus14) URL: #7732

brandon-b-miller · 2021-04-01T17:49:15Z

Reopening this as there's a piece of the ask here that isn't implemented yet: Decimal vs int binary ops.

brandon-b-miller · 2021-04-05T20:32:31Z

@randerzander PR #7859 should close the second part of this - however I think we decided we can't do decimal<->float as we'd need to do some implicit rounding for the user there.

That said since integers are exact numbers I went ahead and added that.

randerzander · 2021-04-12T17:05:20Z

@brandon-b-miller I understand the concern about users not realizing rounding happens in an implicit cast, but it would be nice to allow configurable implicit cast behavior.

For what it's worth, Spark automatically converts (somewhat surprisingly) to a Double:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .getOrCreate()

df = spark.createDataFrame(
    [
        (1, 1.0), # create your data here, be consistent in the types.
        (2, 2.0),
    ],
    ['id', 'doubleCol'] # add your columns label here
)

df = df.withColumn('floatCol', df['doubleCol'].cast('float'))
df = df.withColumn('decCol', df['doubleCol'].cast('decimal(7,2)'))

df.withColumn('mixedResult', df['floatCol'] + df['decCol']).schema

StructType(List(StructField(id,LongType,true),StructField(doubleCol,DoubleType,true),StructField(floatCol,FloatType,true),StructField(decCol,DecimalType(7,2),true),StructField(mixedResult,DoubleType,true)))

brandon-b-miller · 2021-04-13T14:18:12Z

@randerzander thank you for that example. As long as there's some authoritative source for what the rules should be, I'd be comfortable adopting that standard and allowing the behavior. Let me dig into spark a bit and figure out where it derives its casting rules and then follow up here.

brandon-b-miller · 2021-04-16T21:38:45Z

So ideally here we'd like to follow spark's rules because they seem fairly robust and also, because we try and avoid inventing casting rules. Here are those rules (github link)

 * In addition, when mixing non-decimal types with decimals, we use the following rules:
 * - BYTE gets turned into DECIMAL(3, 0)
 * - SHORT gets turned into DECIMAL(5, 0)
 * - INT gets turned into DECIMAL(10, 0)
 * - LONG gets turned into DECIMAL(20, 0)
 * - FLOAT and DOUBLE cause fixed-length decimals to turn into DOUBLE
 * - Literals INT and LONG get turned into DECIMAL with the precision strictly needed by the value

This basically means that when performing a binary op between a decimal column and an integer column, the integer column is cast to decimal with precision p, where p is the maximum number of digits in the maximum representable value corresponding to the integer columns dtype.

The problem is we only support precision 18, and this number is 9223372036854775807 which contains 19 digits. Technically that number itself is representable as decimal in libcudf, but a Decimal64Dtype(19,0) isn't valid in cuDF python, because it implies that any 19 digit number can be represented.

This leaves us at a bit of an impasse because if we try and cast an int64 column to decimal in the way the spark rules specify, we run into our own constraint. Fundamentally spark doesn't have this problem because it supports 38 digits of precision. The problem gets worse when we consider the rules for precision in decimal arithmetic ops. Basically most ops tend to create a result that has a higher precision than the inputs, so even if we elected to use precision 18 for int64 since it's technically safe there's not much we could do with it at that point, since the resulting precision isn't compatible with cudf.

There seem to be 3 options:

Use spark's rules and disable ops involving one decimal and one int64 or uint64 column
Don't use spark's rules, invent our own, possibly scan the data and cast to the minimum required precision
Wait for 128 bit types

None of these seem like great options for me, but I am open to opinions.

github-actions · 2021-05-16T22:04:30Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-02-07T21:05:11Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr · 2022-07-13T00:21:39Z

@shwina @isVoid this seems like another potential candidate for #11193

randerzander added bug Something isn't working Python Affects Python cuDF API. labels Mar 23, 2021

randerzander changed the title ~~[BUG] Can't do Decimal math in cudf Python~~ [FEA] Mixed precision Decimal math support in cudf Python Mar 23, 2021

randerzander added feature request New feature or request and removed bug Something isn't working labels Mar 23, 2021

galipremsagar assigned galipremsagar and brandon-b-miller and unassigned galipremsagar Mar 24, 2021

brandon-b-miller mentioned this issue Mar 25, 2021

[FEA] Implement DecimalColumn + Scalar and add cudf.Scalars of Decimal64Dtype #7732

Merged

rapids-bot bot closed this as completed in #7732 Mar 31, 2021

brandon-b-miller reopened this Apr 1, 2021

brandon-b-miller linked a pull request Apr 5, 2021 that will close this issue

Enable binops between Decimal and Integer columns #7859

Open

github-actions bot added the inactive-30d label May 16, 2021

beckernick added this to the Decimal data type and operations milestone Jul 14, 2021

github-actions bot added the inactive-90d label Feb 7, 2022

vyasr mentioned this issue Feb 8, 2022

[QST] Output Type of DecimalType Binary Operation #10230

Closed

vyasr mentioned this issue Mar 17, 2023

[FEA] Support binary ops between decimal columns and scalars #12958

Closed

vyasr removed inactive-90d labels Feb 23, 2024

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Mixed precision Decimal math support in cudf Python #7680

[FEA] Mixed precision Decimal math support in cudf Python #7680

randerzander commented Mar 23, 2021 •

edited

Loading

brandon-b-miller commented Apr 1, 2021

brandon-b-miller commented Apr 5, 2021

randerzander commented Apr 12, 2021

brandon-b-miller commented Apr 13, 2021

brandon-b-miller commented Apr 16, 2021 •

edited

Loading

github-actions bot commented May 16, 2021

github-actions bot commented Feb 7, 2022

vyasr commented Jul 13, 2022

[FEA] Mixed precision Decimal math support in cudf Python #7680

[FEA] Mixed precision Decimal math support in cudf Python #7680

Comments

randerzander commented Mar 23, 2021 • edited Loading

brandon-b-miller commented Apr 1, 2021

brandon-b-miller commented Apr 5, 2021

randerzander commented Apr 12, 2021

brandon-b-miller commented Apr 13, 2021

brandon-b-miller commented Apr 16, 2021 • edited Loading

github-actions bot commented May 16, 2021

github-actions bot commented Feb 7, 2022

vyasr commented Jul 13, 2022

randerzander commented Mar 23, 2021 •

edited

Loading

brandon-b-miller commented Apr 16, 2021 •

edited

Loading