[QST] Output Type of DecimalType Binary Operation #10230

codereport · 2022-02-07T16:09:08Z

@GregoryKimball was recently trying out the cuDF Decimal128Dtype and ran into an issue:

In [1]: import cudf
In [2]: from cudf.core.dtypes import Decimal128Dtype
In [3]: t1 = Decimal128Dtype(precision=10, scale=2)
In [4]: df = cudf.DataFrame()
In [5]: df["fixed"] = cudf.Series([1]).astype(t1)
In [6]: df["ep_d"] = cudf.Series([0.1]).astype(t1)
In [7]: df["fixed"] + df["ep_d"]
Out[22]: 
0    1.10
dtype: decimal64

The resulting type of a binary operation adding two Decimal128Dtype columns was a Decimal64Dtype column. I looked into it, and it seems we always set the output type to be the "minimum" width that works. See this line.

def _get_decimal_type(lhs_dtype, rhs_dtype, op):
    # ...
    for decimal_type in (
        cudf.Decimal32Dtype,
        cudf.Decimal64Dtype,
        cudf.Decimal128Dtype,
    ):
        try:
            min_decimal_type = decimal_type(precision=precision, scale=scale)
        except ValueError:
            # Call to _validate fails, which means we need
            # to try the next dtype 
            pass
        else:
            return min_decimal_type

    raise OverflowError("Maximum supported decimal type is Decimal128")

Is this intentional? I would expect it to be the same as the input columns. I can set up a PR to fix but I just wanted to check if there was motivation behind this.

@galipremsagar (you worked on the recent PR that introduced this #9533)
@shwina (you worked on the initial support for DecimalTypes with me)

The text was updated successfully, but these errors were encountered:

galipremsagar · 2022-02-07T16:24:38Z

This was intentional, because wherever we had control over changing dtype i.e., during binops of decimal types we can get new precision & scale (which can result in new dtype, xref):

>>> (df["fixed"] + df["ep_d"]).dtype
Decimal64Dtype(precision=11, scale=2)
>>> df["fixed"].dtype
Decimal128Dtype(precision=10, scale=2)
>>> df["ep_d"].dtype
Decimal128Dtype(precision=10, scale=2)

The motivation to return a dtype that can accommodate resulting precision & scale was to not use extra memory here, since we could fit in the precision and scale comfortably in Decimal64Column instead of Decimal128Column.. But open to inputs here if we want to return Decimal128Column.

cc: @beckernick

beckernick · 2022-02-07T16:56:36Z

In general, reducing memory is a worthwhile goal as out of memory errors are one of the most common failure modes. Beyond potentially surprising some users familiar with Decimal operations in engines like Spark, Hive, etc., do you think this poses any correctness concern @codereport @galipremsagar @GregoryKimball ?

codereport · 2022-02-07T17:04:39Z

From my POV, it doesn't pose any correctness concerns. I think it is just a bit surprising. If you have two columns A and B that are both Decimal128Dtype, you add them and get column C. You now might have to cast column C to 128 if it is 32 or 64, and I would just expect it to still be 128. I guess an even odder example is that if you column A and you multiply it by a scalar 1, you might end up getting back a column with a "different type".

There is nothing wrong about the above, just could be a bit confusing is all. As long as we know this is what we want I am fine with it.

beckernick · 2022-02-07T17:31:46Z

Minimizing surprise is valuable, so I'd be interested to hear @shwina 's thoughts on the surprise vs reduced memory here for PyData users

galipremsagar · 2022-02-07T17:40:00Z

On the spark side, there is just one DecimalType with precision ranging from 0-128, should we be doing the same and keep the Decimal32/64/128Column semantics an internal implementation detail? Doing this would be a potential breaking change and probably be deviating from pyarrows soon to come decimal32/64 array's.

Edit: Keeping pyarrow in my mind, this might not be a favorable approach.

bdice · 2022-02-08T07:08:29Z

Is the opposite true as well? Do sums of 64 bit types overflow into 128 bit types? Regardless of the answer, I’d favor preserving dtype, but if overflow does not cast to a wider type then we should definitely avoid casting to a narrower type for correctness.

Secondly, how is this done? This behavior seems like it must induce an extra cast (unnecessary copying?) if it chooses to change types - I would guess that we do not know the type that will fit the results until the results have been computed. Is a temporary array allocated for the wider type and then checked to see if a narrower type is sufficient, and performing a cast-and-copy if so? That approach with the larger temporary array wouldn’t help as much with OOM concerns and seems needlessly costly, but perhaps we have a better implementation.

galipremsagar · 2022-02-08T15:15:26Z

Is the opposite true as well? Do sums of 64 bit types overflow into 128 bit types? Regardless of the answer, I’d favor preserving dtype, but if overflow does not cast to a wider type then we should definitely avoid casting to a narrower type for correctness.

Sum of two 64bit types could result in 128bit types but not always, it is purely based on precision & scale: https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-ver15

Secondly, how is this done? This behavior seems like it must induce an extra cast (unnecessary copying?) if it chooses to change types - I would guess that we do not know the type that will fit the results until the results have been computed. Is a temporary array allocated for the wider type and then checked to see if a narrower type is sufficient, and performing a cast-and-copy if so? That approach with the larger temporary array wouldn’t help as much with OOM concerns and seems needlessly costly, but perhaps we have a better implementation.

I can't speak for what is happening inside libcudf layer, but from the cython side we are required to pass the expected output type before hand here: https://github.com/rapidsai/cudf/blob/branch-22.04/python/cudf/cudf/_lib/binaryop.pyx#L159

revans2 · 2022-02-08T15:16:14Z

Spark plans the input and output types of all operations ahead of time when preparing the query plan. They have rules about how each operator will impact the output type. Generally these rules are looking at the worst case situation and do not take into account the data itself. For example for the add operator the output precision is

max(p1 - s1, p2 - s2) + max(s1, s2) + 1

Where p1 and p2 are the lhs and rhs precisions and s1 and s2 are the lhs and rhs scales of the input operators. (note that scale in spark is the negative cudf scale)

But if I was adding 1.0 + 1.0 where the precision was 10 and the scale was 1 the output data type would have a precision of 11. Does not matter what the actual data in those columns holds. So yes in Spark we end up with a lot of casting to wider types and processing data in types that can cover the worst possible situation.

Overflow checking tends to happen after the operation is done, and really only is an issue when the normal rule would produce a precision that is too large for a DECIMAL_128 to hold (precision of 38). Spark has somewhat arbitrary and configurable rules about what the output precision/scale should be if the rules would cause the precision to go above 38.

I can get into all of the ugly details of all of this if you want to, but it is not pretty. If CUDF could provide APIs, for at least some operations, that would let us pick the smallest data type needed for the output of a decimal operation without overflowing, based off of the data, it would be interesting, and something that we would look into using. However it would be a significant amount of work for us to use it because of things like joins where we need both sides to match the same type.

vyasr · 2022-02-08T19:32:06Z

There's some discussion of how similar rules could be defined in the Python layer here.

revans2 · 2022-02-08T19:45:51Z

I am happy to help with that if anyone has questions about what Spark is doing. I have been up to my eyeballs in it for quite a while. Some of the odd parts of Spark are with divide, where for what ever reason that assume you want a scale of at least 6. And then with the SUM aggregation where they arbitrarily decide that the output precision should be 10 more than the input precision.

vyasr · 2022-02-08T19:47:17Z

I think that would be a discussion to have with @brandon-b-miller who spearheaded a lot of that work, and perhaps @randerzander regarding what our Python users would expect.

brandon-b-miller · 2022-02-08T20:05:38Z

I need some time to remember exactly what all the caveats were here. But for what it's worth, historically we've been open to adopting spark's rules for these decimal edge cases directly, since pandas does not support them and we'd rather not invent a new set of rules.

shwina · 2022-02-08T23:07:53Z

FWIW, my 2c here is that the Decimal128 type should be preserved. Similarly, summing two Decimal32s should give you back a Decimal32. There's no way to tell what might happen downstream of the binop that might need the 128 bits of precision.

vyasr · 2022-02-08T23:14:59Z

My 2c: you should never implicitly narrow the data type since there's no way to know if that's safe without some level of data introspection. Anything else is surprising, and it could even lead to users making unsafe assumptions later. Conversely, I think widening should follow rules similar to what @revans2 mentions above since that may be necessary for safety. Both of these conventions seem important enough to eat the memory costs.

revans2 · 2022-02-09T15:27:07Z

From our experience it is not just a memory cost. There are very real computation costs involved with 128-bits vs 64-bits. For multiply and divide (at least on an a6000) 128-bit operations are computation bound, where as the 64-bit binary operations are bound by memory bandwidth.

When doing aggregates we recently put together a feature/hack that let us do SUM aggregations as 4 64-bit operations instead of a single 128-bit operation. This is because the 128-bit version required sorting (no atomics on 128-bits). This gave us an order of magnitude speedup in some TPC-DS queries. Spark also has to do some overflow checks and were doing that using a divide operator which in some cases was nearly as long as doing the sum aggregation itself (without the sort). So the speedup we were seeing also involved removing that expensive operation.

In fact this gave me an idea and I filed another "Bobby had a crazy idea on how to speed up Decimal" issue NVIDIA/spark-rapids#4729

codereport · 2022-02-14T16:08:06Z

It doesn't really seem like there is a consensus at this point. I will add though in reviewing the 22.02 release blog, it has a decimal128 example the shows addition which narrows to a decimal32. As a C++ dev, it just seems very odd. Seems like the equivalent of adding two std::vector<__int128_t> and getting a std::vector<int32_t>. But maybe that is just my bias coming from the statically typed world of C++.

shwina · 2022-02-14T16:48:53Z

At on the Python side, I think we should never change the dtype when doing binops on two inputs with identical dtype. If this means overflow, then so be it. That is what Pandas does for integer types, for example:

In [13]: s = pd.Series([np.iinfo("int32").max], dtype="int32")

In [14]: s
Out[14]:
0    2147483647
dtype: int32

In [15]: s + s
Out[15]:
0   -2
dtype: int32

@galipremsagar - does that sound OK to you? I'll raise an issue if it does.

galipremsagar · 2022-02-14T17:30:30Z

I will add though in reviewing the 22.02 release blog, it has a decimal128 example the shows addition which narrows to a decimal32.

I agree that it looks odd to see the dtype being narrowed to decimal32.

At on the Python side, I think we should never change the dtype when doing binops on two inputs with identical dtype.

Sounds okay by me, but I think doing so always will defeat the purpose of having decimal types and will force users to upcast their real world decimal32/decimal64 dtype data into decimal128 to get a decimal128 result. Also, there exist rules of precision & scale calculations for decimal types that don't exist for int-like types. Instead, we could do the following:

Let's assume lhs & rhs as two entities that have a binop being performed between them.

If both lhs & rhs have same type:
i. Never narrow the result dtype, if the resulting precision & scale of binop can fit in the same type.
ii. Upcast the result dtype, if the resulting precision & scale of binop can't fit in the same type - This will adhere to the precision & scale calculation rules
If both lhs & rhs have different types:
i. The result dtype will be the higher one of lhs or rhs dtype that can fit resulting precision & scale.
ii. If both lhs & rhs dtypes can't accommodate precision & scale, then Upcast based on the regular rules

I feel this should strike a balance by both avoiding odd narrowing behavior and also adhering to the resultant precision & scale calculation in binops.

shwina · 2022-02-14T17:56:34Z

Thanks @galipremsagar, I think that's fair. I opened #10282 as a bug to track this on the Python side.

codereport · 2022-02-14T19:26:50Z

Thanks @shwina @galipremsagar and everyone that weighed in. I will close this now that there is another issue open.

codereport added question Further information is requested Needs Triage Need team to review and classify labels Feb 7, 2022

codereport self-assigned this Feb 7, 2022

shwina mentioned this issue Feb 14, 2022

[BUG] Upcast (but never downcast) the result of a binary operation between decimals (Python) #10282

Closed

codereport closed this as completed Feb 14, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Output Type of DecimalType Binary Operation #10230

[QST] Output Type of DecimalType Binary Operation #10230

codereport commented Feb 7, 2022

galipremsagar commented Feb 7, 2022 •

edited

Loading

beckernick commented Feb 7, 2022

codereport commented Feb 7, 2022

beckernick commented Feb 7, 2022 •

edited

Loading

galipremsagar commented Feb 7, 2022 •

edited

Loading

bdice commented Feb 8, 2022 •

edited

Loading

galipremsagar commented Feb 8, 2022 •

edited

Loading

revans2 commented Feb 8, 2022

vyasr commented Feb 8, 2022

revans2 commented Feb 8, 2022

vyasr commented Feb 8, 2022

brandon-b-miller commented Feb 8, 2022

shwina commented Feb 8, 2022

vyasr commented Feb 8, 2022

revans2 commented Feb 9, 2022

codereport commented Feb 14, 2022

shwina commented Feb 14, 2022 •

edited

Loading

galipremsagar commented Feb 14, 2022 •

edited

Loading

shwina commented Feb 14, 2022

codereport commented Feb 14, 2022

[QST] Output Type of DecimalType Binary Operation #10230

[QST] Output Type of DecimalType Binary Operation #10230

Comments

codereport commented Feb 7, 2022

galipremsagar commented Feb 7, 2022 • edited Loading

beckernick commented Feb 7, 2022

codereport commented Feb 7, 2022

beckernick commented Feb 7, 2022 • edited Loading

galipremsagar commented Feb 7, 2022 • edited Loading

bdice commented Feb 8, 2022 • edited Loading

galipremsagar commented Feb 8, 2022 • edited Loading

revans2 commented Feb 8, 2022

vyasr commented Feb 8, 2022

revans2 commented Feb 8, 2022

vyasr commented Feb 8, 2022

brandon-b-miller commented Feb 8, 2022

shwina commented Feb 8, 2022

vyasr commented Feb 8, 2022

revans2 commented Feb 9, 2022

codereport commented Feb 14, 2022

shwina commented Feb 14, 2022 • edited Loading

galipremsagar commented Feb 14, 2022 • edited Loading

shwina commented Feb 14, 2022

codereport commented Feb 14, 2022

galipremsagar commented Feb 7, 2022 •

edited

Loading

beckernick commented Feb 7, 2022 •

edited

Loading

galipremsagar commented Feb 7, 2022 •

edited

Loading

bdice commented Feb 8, 2022 •

edited

Loading

galipremsagar commented Feb 8, 2022 •

edited

Loading

shwina commented Feb 14, 2022 •

edited

Loading

galipremsagar commented Feb 14, 2022 •

edited

Loading