-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maybe buggy nullable integer floor division by 0 #30188
Comments
This is a case where the Series behavior is different from the ndarray behavior. Unless there's a compelling reason to match ndarray, I'd prefer to match Series (and DataFrame and Int64Index) |
@jbrockmendel Wouldn't matching it to Series be less intuitive for the users? They'd have a numpy array of integers and a pandas array of integers, and those would behave differently. In my opinion, that is less expected than IntegerArray and Series behaving differently (these are very different classes). |
The numpy behaviour is actually different for integers:
Not fully sure what the rationale of that difference is, though |
@jorisvandenbossche, thanks. The behavior of IntegerArray turns out to be the same as np.array:
To me, this looks like a numpy bug. |
The reasoning might be: floor division of an integer array returns a new integer array. So for that reason, the result cannot hold NaN, and 0 was taken instead. |
Hmm so do we have a decision on the expected value? It's not clear to me what's best here. |
It's also different yet again if the zero you're dividing by is float instead of integer. A lot of effort went in to making the Series implementation of truediv/floordiv by zero not have these surprises, and the tests (tests.arithmetic.test_numeric) are fairly thorough. Could we do something like wrap the ops.array_ops function and then re-mask? Maybe even add IntegerArray to the parametrization in test_numeric |
also seems like the options.treat_inf_as_na would be relevant |
Yes, but then you are again doing a float floor division, not an integer one.
And if the result is int, then you are limited in what the value can be.
What kind of surprises do you mean exactly? For true division I think Series and numpy array behave the same? (except for the warning that we hide)
I never use this, but I think this is about interpretation of inf as NA in functions that detect NA (isna, fillna, etc), not about creation of inf in operations (which is what we are discussing here I think) |
I think you're right about how we use that option. What im thinking of is:
Suppose we agree that IntegerArray should behave like Series. Then the IntegerArray op gets back |
So what is the rationale of the current Series behaviour for floor division by 0? When returning a float result, why not follow numpy's behaviour for floats of returning all NaN ? |
My recollection of the reasoning is that the semantic meaning of division/floordiv isn't impacted by the dtypes of numerator or denominator, so the results shouldn't be either. |
That explains why we have the same behaviour for ints and floats in Series, but not necessarily why we chose a different behaviour for floats compared to numpy?
|
I think it was about saying that floordiv should be "close" to truediv. I like to think of it as limiting behavior, but not sure if that was part of the discussion at the time. |
Here is some relevant discussion about returning NaN vs Inf from float floordiv: numpy/numpy#7709. So it seems most people involved in that issue agree Inf would be more logical, but nobody got ever to actually making the change at the time. |
@jorisvandenbossche @jbrockmendel After reading everything, I think we should mimic Series. It shouldn't matter whether a user does |
related #22793 |
Looks like numpy/numpy#16161 changed their floordiv behavior, will have to see how much of this problem that solves. |
Numpy indeed changed (fixed) the floordiv behaviour for float. So instead of (with older numpy):
using latest numpy 1.20, we now get:
For integer (the actual topic of this issue), nothing changed though, and we still have:
I think for integer floor div, if we want to keep type stability of floor division always returning integer dtype (regardless of the exact value you're dividing with), the main options for 1 // 0 are: return 0 (like above, following numpy) or raise an error (something we otherwise don't do for other "invalid" operations in pandas). Giving we have missing values here (which numpy doesn't have), we could also return NA for those cases where float floordiv would give NaN or Inf. That would at least give some indication that something went wrong. But on the other hand, that would also create an inconsistency with our nullable floating dtype, which does currently not result in NA (since it can use NaN/Inf). |
Trying to address this, the correct behavior depends on a resolution to #32265 |
Would it be terrible to leave the behavior undefined for the integer case? I think there's some precedent for that from c/c++ and it would avoid forcing data introspection / find and replace in some situations. |
I think internal consistency with non-masked behavior is pretty important, yes. |
IIUC there's the issue of which pandas objects should behave the same as each other, and secondly if any of them should follow what numpy does for integer floor division by zero (warning, return I can't speak so much to the former, but for the latter, I merely mean to offer that perhaps numpy should not be followed in this edge case. Indeed it would seem that this came up once before (numpy/numpy#5150) and in general when facing this issue myself I tend to think along the same lines as the OP from the numpy board: preferring not to make the choice. |
From @jschendel in #30183 (comment)
Those should probably all be NA, to match the ndarray behavior.
The text was updated successfully, but these errors were encountered: