Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Output Type of DecimalType Binary Operation #10230

Closed
codereport opened this issue Feb 7, 2022 · 20 comments
Closed

[QST] Output Type of DecimalType Binary Operation #10230

codereport opened this issue Feb 7, 2022 · 20 comments
Assignees
Labels
question Further information is requested

Comments

@codereport
Copy link
Contributor

@GregoryKimball was recently trying out the cuDF Decimal128Dtype and ran into an issue:

In [1]: import cudf
In [2]: from cudf.core.dtypes import Decimal128Dtype
In [3]: t1 = Decimal128Dtype(precision=10, scale=2)
In [4]: df = cudf.DataFrame()
In [5]: df["fixed"] = cudf.Series([1]).astype(t1)
In [6]: df["ep_d"] = cudf.Series([0.1]).astype(t1)
In [7]: df["fixed"] + df["ep_d"]
Out[22]: 
0    1.10
dtype: decimal64

The resulting type of a binary operation adding two Decimal128Dtype columns was a Decimal64Dtype column. I looked into it, and it seems we always set the output type to be the "minimum" width that works. See this line.

def _get_decimal_type(lhs_dtype, rhs_dtype, op):
    # ...
    for decimal_type in (
        cudf.Decimal32Dtype,
        cudf.Decimal64Dtype,
        cudf.Decimal128Dtype,
    ):
        try:
            min_decimal_type = decimal_type(precision=precision, scale=scale)
        except ValueError:
            # Call to _validate fails, which means we need
            # to try the next dtype 
            pass
        else:
            return min_decimal_type

    raise OverflowError("Maximum supported decimal type is Decimal128")

Is this intentional? I would expect it to be the same as the input columns. I can set up a PR to fix but I just wanted to check if there was motivation behind this.

@galipremsagar (you worked on the recent PR that introduced this #9533)
@shwina (you worked on the initial support for DecimalTypes with me)

@codereport codereport added question Further information is requested Needs Triage Need team to review and classify labels Feb 7, 2022
@codereport codereport self-assigned this Feb 7, 2022
@galipremsagar
Copy link
Contributor

galipremsagar commented Feb 7, 2022

This was intentional, because wherever we had control over changing dtype i.e., during binops of decimal types we can get new precision & scale (which can result in new dtype, xref):

>>> (df["fixed"] + df["ep_d"]).dtype
Decimal64Dtype(precision=11, scale=2)
>>> df["fixed"].dtype
Decimal128Dtype(precision=10, scale=2)
>>> df["ep_d"].dtype
Decimal128Dtype(precision=10, scale=2)

The motivation to return a dtype that can accommodate resulting precision & scale was to not use extra memory here, since we could fit in the precision and scale comfortably in Decimal64Column instead of Decimal128Column.. But open to inputs here if we want to return Decimal128Column.

cc: @beckernick

@beckernick
Copy link
Member

In general, reducing memory is a worthwhile goal as out of memory errors are one of the most common failure modes. Beyond potentially surprising some users familiar with Decimal operations in engines like Spark, Hive, etc., do you think this poses any correctness concern @codereport @galipremsagar @GregoryKimball ?

@codereport
Copy link
Contributor Author

From my POV, it doesn't pose any correctness concerns. I think it is just a bit surprising. If you have two columns A and B that are both Decimal128Dtype, you add them and get column C. You now might have to cast column C to 128 if it is 32 or 64, and I would just expect it to still be 128. I guess an even odder example is that if you column A and you multiply it by a scalar 1, you might end up getting back a column with a "different type".

There is nothing wrong about the above, just could be a bit confusing is all. As long as we know this is what we want I am fine with it.

@beckernick
Copy link
Member

beckernick commented Feb 7, 2022

Minimizing surprise is valuable, so I'd be interested to hear @shwina 's thoughts on the surprise vs reduced memory here for PyData users

@galipremsagar
Copy link
Contributor

galipremsagar commented Feb 7, 2022

On the spark side, there is just one DecimalType with precision ranging from 0-128, should we be doing the same and keep the Decimal32/64/128Column semantics an internal implementation detail? Doing this would be a potential breaking change and probably be deviating from pyarrows soon to come decimal32/64 array's.

Edit: Keeping pyarrow in my mind, this might not be a favorable approach.

@bdice
Copy link
Contributor

bdice commented Feb 8, 2022

Is the opposite true as well? Do sums of 64 bit types overflow into 128 bit types? Regardless of the answer, I’d favor preserving dtype, but if overflow does not cast to a wider type then we should definitely avoid casting to a narrower type for correctness.

Secondly, how is this done? This behavior seems like it must induce an extra cast (unnecessary copying?) if it chooses to change types - I would guess that we do not know the type that will fit the results until the results have been computed. Is a temporary array allocated for the wider type and then checked to see if a narrower type is sufficient, and performing a cast-and-copy if so? That approach with the larger temporary array wouldn’t help as much with OOM concerns and seems needlessly costly, but perhaps we have a better implementation.

@galipremsagar
Copy link
Contributor

galipremsagar commented Feb 8, 2022

Is the opposite true as well? Do sums of 64 bit types overflow into 128 bit types? Regardless of the answer, I’d favor preserving dtype, but if overflow does not cast to a wider type then we should definitely avoid casting to a narrower type for correctness.

Sum of two 64bit types could result in 128bit types but not always, it is purely based on precision & scale: https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-ver15

Secondly, how is this done? This behavior seems like it must induce an extra cast (unnecessary copying?) if it chooses to change types - I would guess that we do not know the type that will fit the results until the results have been computed. Is a temporary array allocated for the wider type and then checked to see if a narrower type is sufficient, and performing a cast-and-copy if so? That approach with the larger temporary array wouldn’t help as much with OOM concerns and seems needlessly costly, but perhaps we have a better implementation.

I can't speak for what is happening inside libcudf layer, but from the cython side we are required to pass the expected output type before hand here: https://github.com/rapidsai/cudf/blob/branch-22.04/python/cudf/cudf/_lib/binaryop.pyx#L159

@revans2
Copy link
Contributor

revans2 commented Feb 8, 2022

Spark plans the input and output types of all operations ahead of time when preparing the query plan. They have rules about how each operator will impact the output type. Generally these rules are looking at the worst case situation and do not take into account the data itself. For example for the add operator the output precision is

max(p1 - s1, p2 - s2) + max(s1, s2) + 1

Where p1 and p2 are the lhs and rhs precisions and s1 and s2 are the lhs and rhs scales of the input operators. (note that scale in spark is the negative cudf scale)

But if I was adding 1.0 + 1.0 where the precision was 10 and the scale was 1 the output data type would have a precision of 11. Does not matter what the actual data in those columns holds. So yes in Spark we end up with a lot of casting to wider types and processing data in types that can cover the worst possible situation.

Overflow checking tends to happen after the operation is done, and really only is an issue when the normal rule would produce a precision that is too large for a DECIMAL_128 to hold (precision of 38). Spark has somewhat arbitrary and configurable rules about what the output precision/scale should be if the rules would cause the precision to go above 38.

I can get into all of the ugly details of all of this if you want to, but it is not pretty. If CUDF could provide APIs, for at least some operations, that would let us pick the smallest data type needed for the output of a decimal operation without overflowing, based off of the data, it would be interesting, and something that we would look into using. However it would be a significant amount of work for us to use it because of things like joins where we need both sides to match the same type.

@vyasr
Copy link
Contributor

vyasr commented Feb 8, 2022

There's some discussion of how similar rules could be defined in the Python layer here.

@revans2
Copy link
Contributor

revans2 commented Feb 8, 2022

I am happy to help with that if anyone has questions about what Spark is doing. I have been up to my eyeballs in it for quite a while. Some of the odd parts of Spark are with divide, where for what ever reason that assume you want a scale of at least 6. And then with the SUM aggregation where they arbitrarily decide that the output precision should be 10 more than the input precision.

@vyasr
Copy link
Contributor

vyasr commented Feb 8, 2022

I think that would be a discussion to have with @brandon-b-miller who spearheaded a lot of that work, and perhaps @randerzander regarding what our Python users would expect.

@brandon-b-miller
Copy link
Contributor

I need some time to remember exactly what all the caveats were here. But for what it's worth, historically we've been open to adopting spark's rules for these decimal edge cases directly, since pandas does not support them and we'd rather not invent a new set of rules.

@shwina
Copy link
Contributor

shwina commented Feb 8, 2022

FWIW, my 2c here is that the Decimal128 type should be preserved. Similarly, summing two Decimal32s should give you back a Decimal32. There's no way to tell what might happen downstream of the binop that might need the 128 bits of precision.

@vyasr
Copy link
Contributor

vyasr commented Feb 8, 2022

My 2c: you should never implicitly narrow the data type since there's no way to know if that's safe without some level of data introspection. Anything else is surprising, and it could even lead to users making unsafe assumptions later. Conversely, I think widening should follow rules similar to what @revans2 mentions above since that may be necessary for safety. Both of these conventions seem important enough to eat the memory costs.

@revans2
Copy link
Contributor

revans2 commented Feb 9, 2022

From our experience it is not just a memory cost. There are very real computation costs involved with 128-bits vs 64-bits. For multiply and divide (at least on an a6000) 128-bit operations are computation bound, where as the 64-bit binary operations are bound by memory bandwidth.

When doing aggregates we recently put together a feature/hack that let us do SUM aggregations as 4 64-bit operations instead of a single 128-bit operation. This is because the 128-bit version required sorting (no atomics on 128-bits). This gave us an order of magnitude speedup in some TPC-DS queries. Spark also has to do some overflow checks and were doing that using a divide operator which in some cases was nearly as long as doing the sum aggregation itself (without the sort). So the speedup we were seeing also involved removing that expensive operation.

In fact this gave me an idea and I filed another "Bobby had a crazy idea on how to speed up Decimal" issue NVIDIA/spark-rapids#4729

@codereport
Copy link
Contributor Author

It doesn't really seem like there is a consensus at this point. I will add though in reviewing the 22.02 release blog, it has a decimal128 example the shows addition which narrows to a decimal32. As a C++ dev, it just seems very odd. Seems like the equivalent of adding two std::vector<__int128_t> and getting a std::vector<int32_t>. But maybe that is just my bias coming from the statically typed world of C++.

image

@shwina
Copy link
Contributor

shwina commented Feb 14, 2022

At on the Python side, I think we should never change the dtype when doing binops on two inputs with identical dtype. If this means overflow, then so be it. That is what Pandas does for integer types, for example:

In [13]: s = pd.Series([np.iinfo("int32").max], dtype="int32")

In [14]: s
Out[14]:
0    2147483647
dtype: int32

In [15]: s + s
Out[15]:
0   -2
dtype: int32

@galipremsagar - does that sound OK to you? I'll raise an issue if it does.

@galipremsagar
Copy link
Contributor

galipremsagar commented Feb 14, 2022

I will add though in reviewing the 22.02 release blog, it has a decimal128 example the shows addition which narrows to a decimal32.

I agree that it looks odd to see the dtype being narrowed to decimal32.

At on the Python side, I think we should never change the dtype when doing binops on two inputs with identical dtype.

Sounds okay by me, but I think doing so always will defeat the purpose of having decimal types and will force users to upcast their real world decimal32/decimal64 dtype data into decimal128 to get a decimal128 result. Also, there exist rules of precision & scale calculations for decimal types that don't exist for int-like types. Instead, we could do the following:

Let's assume lhs & rhs as two entities that have a binop being performed between them.

  1. If both lhs & rhs have same type:
    i. Never narrow the result dtype, if the resulting precision & scale of binop can fit in the same type.
    ii. Upcast the result dtype, if the resulting precision & scale of binop can't fit in the same type - This will adhere to the precision & scale calculation rules

  2. If both lhs & rhs have different types:
    i. The result dtype will be the higher one of lhs or rhs dtype that can fit resulting precision & scale.
    ii. If both lhs & rhs dtypes can't accommodate precision & scale, then Upcast based on the regular rules

I feel this should strike a balance by both avoiding odd narrowing behavior and also adhering to the resultant precision & scale calculation in binops.

@shwina
Copy link
Contributor

shwina commented Feb 14, 2022

Thanks @galipremsagar, I think that's fair. I opened #10282 as a bug to track this on the Python side.

@codereport
Copy link
Contributor Author

Thanks @shwina @galipremsagar and everyone that weighed in. I will close this now that there is another issue open.

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants