-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Output Type of DecimalType Binary Operation #10230
Comments
This was intentional, because wherever we had control over changing dtype i.e., during binops of >>> (df["fixed"] + df["ep_d"]).dtype
Decimal64Dtype(precision=11, scale=2)
>>> df["fixed"].dtype
Decimal128Dtype(precision=10, scale=2)
>>> df["ep_d"].dtype
Decimal128Dtype(precision=10, scale=2) The motivation to return a cc: @beckernick |
In general, reducing memory is a worthwhile goal as out of memory errors are one of the most common failure modes. Beyond potentially surprising some users familiar with Decimal operations in engines like Spark, Hive, etc., do you think this poses any correctness concern @codereport @galipremsagar @GregoryKimball ? |
From my POV, it doesn't pose any correctness concerns. I think it is just a bit surprising. If you have two columns A and B that are both Decimal128Dtype, you add them and get column C. You now might have to cast column C to 128 if it is 32 or 64, and I would just expect it to still be 128. I guess an even odder example is that if you column A and you multiply it by a scalar 1, you might end up getting back a column with a "different type". There is nothing wrong about the above, just could be a bit confusing is all. As long as we know this is what we want I am fine with it. |
Minimizing surprise is valuable, so I'd be interested to hear @shwina 's thoughts on the surprise vs reduced memory here for PyData users |
On the spark side, there is just one Edit: Keeping |
Is the opposite true as well? Do sums of 64 bit types overflow into 128 bit types? Regardless of the answer, I’d favor preserving dtype, but if overflow does not cast to a wider type then we should definitely avoid casting to a narrower type for correctness. Secondly, how is this done? This behavior seems like it must induce an extra cast (unnecessary copying?) if it chooses to change types - I would guess that we do not know the type that will fit the results until the results have been computed. Is a temporary array allocated for the wider type and then checked to see if a narrower type is sufficient, and performing a cast-and-copy if so? That approach with the larger temporary array wouldn’t help as much with OOM concerns and seems needlessly costly, but perhaps we have a better implementation. |
Sum of two 64bit types could result in 128bit types but not always, it is purely based on
I can't speak for what is happening inside libcudf layer, but from the cython side we are required to pass the expected output type before hand here: https://github.com/rapidsai/cudf/blob/branch-22.04/python/cudf/cudf/_lib/binaryop.pyx#L159 |
Spark plans the input and output types of all operations ahead of time when preparing the query plan. They have rules about how each operator will impact the output type. Generally these rules are looking at the worst case situation and do not take into account the data itself. For example for the add operator the output precision is
Where p1 and p2 are the lhs and rhs precisions and s1 and s2 are the lhs and rhs scales of the input operators. (note that scale in spark is the negative cudf scale) But if I was adding 1.0 + 1.0 where the precision was 10 and the scale was 1 the output data type would have a precision of 11. Does not matter what the actual data in those columns holds. So yes in Spark we end up with a lot of casting to wider types and processing data in types that can cover the worst possible situation. Overflow checking tends to happen after the operation is done, and really only is an issue when the normal rule would produce a precision that is too large for a DECIMAL_128 to hold (precision of 38). Spark has somewhat arbitrary and configurable rules about what the output precision/scale should be if the rules would cause the precision to go above 38. I can get into all of the ugly details of all of this if you want to, but it is not pretty. If CUDF could provide APIs, for at least some operations, that would let us pick the smallest data type needed for the output of a decimal operation without overflowing, based off of the data, it would be interesting, and something that we would look into using. However it would be a significant amount of work for us to use it because of things like joins where we need both sides to match the same type. |
There's some discussion of how similar rules could be defined in the Python layer here. |
I am happy to help with that if anyone has questions about what Spark is doing. I have been up to my eyeballs in it for quite a while. Some of the odd parts of Spark are with divide, where for what ever reason that assume you want a scale of at least 6. And then with the SUM aggregation where they arbitrarily decide that the output precision should be 10 more than the input precision. |
I think that would be a discussion to have with @brandon-b-miller who spearheaded a lot of that work, and perhaps @randerzander regarding what our Python users would expect. |
I need some time to remember exactly what all the caveats were here. But for what it's worth, historically we've been open to adopting spark's rules for these decimal edge cases directly, since pandas does not support them and we'd rather not invent a new set of rules. |
FWIW, my 2c here is that the Decimal128 type should be preserved. Similarly, summing two Decimal32s should give you back a Decimal32. There's no way to tell what might happen downstream of the binop that might need the 128 bits of precision. |
My 2c: you should never implicitly narrow the data type since there's no way to know if that's safe without some level of data introspection. Anything else is surprising, and it could even lead to users making unsafe assumptions later. Conversely, I think widening should follow rules similar to what @revans2 mentions above since that may be necessary for safety. Both of these conventions seem important enough to eat the memory costs. |
From our experience it is not just a memory cost. There are very real computation costs involved with 128-bits vs 64-bits. For multiply and divide (at least on an a6000) 128-bit operations are computation bound, where as the 64-bit binary operations are bound by memory bandwidth. When doing aggregates we recently put together a feature/hack that let us do SUM aggregations as 4 64-bit operations instead of a single 128-bit operation. This is because the 128-bit version required sorting (no atomics on 128-bits). This gave us an order of magnitude speedup in some TPC-DS queries. Spark also has to do some overflow checks and were doing that using a divide operator which in some cases was nearly as long as doing the sum aggregation itself (without the sort). So the speedup we were seeing also involved removing that expensive operation. In fact this gave me an idea and I filed another "Bobby had a crazy idea on how to speed up Decimal" issue NVIDIA/spark-rapids#4729 |
It doesn't really seem like there is a consensus at this point. I will add though in reviewing the 22.02 release blog, it has a |
At on the Python side, I think we should never change the dtype when doing binops on two inputs with identical dtype. If this means overflow, then so be it. That is what Pandas does for integer types, for example: In [13]: s = pd.Series([np.iinfo("int32").max], dtype="int32")
In [14]: s
Out[14]:
0 2147483647
dtype: int32
In [15]: s + s
Out[15]:
0 -2
dtype: int32 @galipremsagar - does that sound OK to you? I'll raise an issue if it does. |
I agree that it looks odd to see the
Sounds okay by me, but I think doing so always will defeat the purpose of having decimal types and will force users to upcast their real world decimal32/decimal64 dtype data into Let's assume
I feel this should strike a balance by both avoiding odd narrowing behavior and also adhering to the resultant |
Thanks @galipremsagar, I think that's fair. I opened #10282 as a bug to track this on the Python side. |
Thanks @shwina @galipremsagar and everyone that weighed in. I will close this now that there is another issue open. |
@GregoryKimball was recently trying out the cuDF
Decimal128Dtype
and ran into an issue:The resulting type of a binary operation adding two
Decimal128Dtype
columns was aDecimal64Dtype
column. I looked into it, and it seems we always set the output type to be the "minimum" width that works. See this line.Is this intentional? I would expect it to be the same as the input columns. I can set up a PR to fix but I just wanted to check if there was motivation behind this.
@galipremsagar (you worked on the recent PR that introduced this #9533)
@shwina (you worked on the initial support for DecimalTypes with me)
The text was updated successfully, but these errors were encountered: