Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe buggy nullable integer floor division by 0 #30188

Open
TomAugspurger opened this issue Dec 10, 2019 · 23 comments
Open

Maybe buggy nullable integer floor division by 0 #30188

TomAugspurger opened this issue Dec 10, 2019 · 23 comments
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 10, 2019

From @jschendel in #30183 (comment)

In [6]: a = pd.array([0, 1, -1, None], dtype="Int64")

In [7]: a // 0
Out[7]: 
<IntegerArray>
[0, 0, 0, NaN]
Length: 4, dtype: Int64

In [8]: a // -0
Out[8]: 
<IntegerArray>
[0, 0, 0, NaN]
Length: 4, dtype: Int64

In [9]: a // 0.0
Out[9]: array([nan, nan, nan, nan])

In [10]: a // -0.0
Out[10]: array([nan, nan, nan, nan])

Those should probably all be NA, to match the ndarray behavior.

@TomAugspurger TomAugspurger added the Numeric Operations Arithmetic, Comparison, and Logical operations label Dec 10, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Dec 10, 2019
@jbrockmendel
Copy link
Member

to match the ndarray behavior

This is a case where the Series behavior is different from the ndarray behavior. Unless there's a compelling reason to match ndarray, I'd prefer to match Series (and DataFrame and Int64Index)

@AlexKirko
Copy link
Member

@jbrockmendel Wouldn't matching it to Series be less intuitive for the users? They'd have a numpy array of integers and a pandas array of integers, and those would behave differently. In my opinion, that is less expected than IntegerArray and Series behaving differently (these are very different classes).
On the other hand, matching to Series promotes consistent processing of array-like objects in pandas, but I don't think that's as important as intuitive behavior.

@jorisvandenbossche
Copy link
Member

Those should probably all be NA, to match the ndarray behavior.

The numpy behaviour is actually different for integers:

In [16]: np.array([0, 0, 1]) // 0  
/home/joris/miniconda3/envs/dev/bin/ipython:1: RuntimeWarning: divide by zero encountered in floor_divide
  #!/home/joris/miniconda3/envs/dev/bin/python
Out[16]: array([0, 0, 0])

In [17]: np.array([0., 0., 1.]) // 0  
/home/joris/miniconda3/envs/dev/bin/ipython:1: RuntimeWarning: invalid value encountered in floor_divide
  #!/home/joris/miniconda3/envs/dev/bin/python
Out[17]: array([nan, nan, nan])

Not fully sure what the rationale of that difference is, though

@AlexKirko
Copy link
Member

AlexKirko commented Dec 11, 2019

@jorisvandenbossche, thanks. The behavior of IntegerArray turns out to be the same as np.array:

np.array([0,1,-1]) // 0
C:\Users\amata_n2pl6zt\.conda\envs\pandas-dev\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in floor_divide
array([0, 0, 0], dtype=int32)

pd.array([0,1,-1],dtype='Int32') // 0
<IntegerArray>
[0, 0, 0]
Length: 3, dtype: Int32

To me, this looks like a numpy bug.

@jorisvandenbossche
Copy link
Member

The reasoning might be: floor division of an integer array returns a new integer array. So for that reason, the result cannot hold NaN, and 0 was taken instead.

@simonjayhawkins simonjayhawkins added the ExtensionArray Extending pandas with custom dtypes or arrays. label Dec 11, 2019
@TomAugspurger
Copy link
Contributor Author

Hmm so do we have a decision on the expected value? It's not clear to me what's best here.

@jbrockmendel
Copy link
Member

The numpy behaviour is actually different for integers:

It's also different yet again if the zero you're dividing by is float instead of integer.

A lot of effort went in to making the Series implementation of truediv/floordiv by zero not have these surprises, and the tests (tests.arithmetic.test_numeric) are fairly thorough.

Could we do something like wrap the ops.array_ops function and then re-mask? Maybe even add IntegerArray to the parametrization in test_numeric

@jbrockmendel
Copy link
Member

also seems like the options.treat_inf_as_na would be relevant

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 11, 2019

It's also different yet again if the zero you're dividing by is float instead of integer.

Yes, but then you are again doing a float floor division, not an integer one.
I suppose that numpy is going for type stability here, and decided on those return types:

int // int -> int
int // float -> float
float // float -> float

And if the result is int, then you are limited in what the value can be.

A lot of effort went in to making the Series implementation of truediv/floordiv by zero not have these surprises

What kind of surprises do you mean exactly? For true division I think Series and numpy array behave the same? (except for the warning that we hide)

also seems like the options.treat_inf_as_na would be relevant

I never use this, but I think this is about interpretation of inf as NA in functions that detect NA (isna, fillna, etc), not about creation of inf in operations (which is what we are discussing here I think)
Which surprises do you mean exactly? I don't think we treat true division differently as in numpy?

@jbrockmendel
Copy link
Member

I never use this, but I think this is about interpretation of inf as NA in functions that detect NA (isna, fillna, etc), not about creation of inf in operations

I think you're right about how we use that option. What im thinking of is:

arr = pd.array([-1, 0, 1])
ser = pd.Series([-1, 0, 1])

>>> ser // 0
0   -inf
1    NaN
2    inf
dtype: float64

Suppose we agree that IntegerArray should behave like Series. Then the IntegerArray op gets back np.array([-inf, NaN, inf]) and has to decide how to case the result. If we decide (possibly via an option) to treat inf as NA, then we could return an all-NA IntegerArray result instead of a float array.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 11, 2019

Suppose we agree that IntegerArray should behave like Series.

So what is the rationale of the current Series behaviour for floor division by 0? When returning a float result, why not follow numpy's behaviour for floats of returning all NaN ?

@jbrockmendel
Copy link
Member

So what is the rationale of the current Series behaviour for floor division by 0?

My recollection of the reasoning is that the semantic meaning of division/floordiv isn't impacted by the dtypes of numerator or denominator, so the results shouldn't be either.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 11, 2019

That explains why we have the same behaviour for ints and floats in Series, but not necessarily why we chose a different behaviour for floats compared to numpy?

In [54]: pd.Series([-1., 0., 1.]) // 0. 
Out[54]: 
0   -inf
1    NaN
2    inf
dtype: float64

In [55]: np.array([-1., 0., 1.]) // 0. 
/home/joris/miniconda3/envs/dev/bin/ipython:1: RuntimeWarning: invalid value encountered in floor_divide
  #!/home/joris/miniconda3/envs/dev/bin/python
Out[55]: array([nan, nan, nan])

@jbrockmendel
Copy link
Member

but not necessarily why we chose a different behaviour for floats compared to numpy?

I think it was about saying that floordiv should be "close" to truediv. I like to think of it as limiting behavior, but not sure if that was part of the discussion at the time.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 11, 2019

Here is some relevant discussion about returning NaN vs Inf from float floordiv: numpy/numpy#7709. So it seems most people involved in that issue agree Inf would be more logical, but nobody got ever to actually making the change at the time.

@AlexKirko
Copy link
Member

@jorisvandenbossche @jbrockmendel After reading everything, I think we should mimic Series. It shouldn't matter whether a user does 1 // 0, 1 / 0, 1. // 0, or 1. / 0, because mathematically it's all the same thing. If this introduces inconsistencies with numpy, that's not good, but it's better than inconsistencies with math and logic.

@jbrockmendel
Copy link
Member

related #22793

@jbrockmendel
Copy link
Member

Looks like numpy/numpy#16161 changed their floordiv behavior, will have to see how much of this problem that solves.

@jorisvandenbossche jorisvandenbossche changed the title Maybe buggy IntegerArray floordivision by 0 Maybe buggy nullable integer floor division by 0 Feb 17, 2021
@jorisvandenbossche
Copy link
Member

Numpy indeed changed (fixed) the floordiv behaviour for float. So instead of (with older numpy):

In [1]: a = pd.array([0, 1, -1, None], dtype="Float64")

In [2]: a // 0.0
Out[2]: 
<FloatingArray>
[nan, nan, nan, <NA>]
Length: 4, dtype: Float64

using latest numpy 1.20, we now get:

In [2]: a // 0.0
Out[2]: 
<FloatingArray>
[nan, inf, -inf, <NA>]
Length: 4, dtype: Float64

For integer (the actual topic of this issue), nothing changed though, and we still have:

In [3]: a = pd.array([0, 1, -1, None], dtype="Int64")

In [4]: a // 0
Out[4]: 
<IntegerArray>
[0, 0, 0, <NA>]
Length: 4, dtype: Int64

I think for integer floor div, if we want to keep type stability of floor division always returning integer dtype (regardless of the exact value you're dividing with), the main options for 1 // 0 are: return 0 (like above, following numpy) or raise an error (something we otherwise don't do for other "invalid" operations in pandas).

Giving we have missing values here (which numpy doesn't have), we could also return NA for those cases where float floordiv would give NaN or Inf. That would at least give some indication that something went wrong. But on the other hand, that would also create an inconsistency with our nullable floating dtype, which does currently not result in NA (since it can use NaN/Inf).

cc @brandon-b-miller

@jbrockmendel
Copy link
Member

Trying to address this, the correct behavior depends on a resolution to #32265

@brandon-b-miller
Copy link

Would it be terrible to leave the behavior undefined for the integer case? I think there's some precedent for that from c/c++ and it would avoid forcing data introspection / find and replace in some situations.

@jbrockmendel
Copy link
Member

Would it be terrible to leave the behavior undefined for the integer case?

I think internal consistency with non-masked behavior is pretty important, yes.

@brandon-b-miller
Copy link

IIUC there's the issue of which pandas objects should behave the same as each other, and secondly if any of them should follow what numpy does for integer floor division by zero (warning, return 0).

I can't speak so much to the former, but for the latter, I merely mean to offer that perhaps numpy should not be followed in this edge case. Indeed it would seem that this came up once before (numpy/numpy#5150) and in general when facing this issue myself I tend to think along the same lines as the OP from the numpy board: preferring not to make the choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

7 participants