-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Check for overflow in TimedeltaIndex addition. #14237
Conversation
There is a similar bug for adding The slightly tricky part for this patch is how we check for overflow. For example, if we look at this function here, we replace certain elements with |
# Since Timestamp objects can never have a negative value, | ||
# we just need to check that the elements in result are also | ||
# nonnegative when their corresponding elements in i8 are. | ||
for x, y in zip(i8, result): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is performant - shouldn't this really be done via a checked_add
loop in cython (something like the approach here)? Then you could use the same elsewhere.
Might also need to be an option to turn off the checking? (e.g. do we also check on plain int64
s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-b1 : That probably is possible, but that first sentence in your link seems to be somewhat discouraging. Also, what is the rational for turning off the checking? Or rather, why would overflow be a good thing in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To that point overflow isn't guaranteed produce a negative number here - it is undefined behavior in C, so impl defined. But it is possible to check for it, simpler version here
re: option - it's never a good thing, just meant that you may not always want to pay the perf penalty for it, along the lines of bounds checking, and (presumably) why numpy doesn't do it on regular ints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-b1 : Well at the moment, numpy
disregard for overflow is what is causing us headaches here. 😠 Also, the solutions you have provided are on an element by element basis, so where is the performance boost since we have to check an entire array of results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to check before the addition, so what I meant is replacing the addition above with somethin like
result = algos._checked_add(i8, other.value)
where _checked_add
is a cython func that does the checking/addition inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, of course. The Cython part escaped me briefly there for some reason.
Let me first placate the Appveyor before I attempt the optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a fresh look, and it seems like my question regarding masking is also relevant to this patch as well. Should we be complaining about overflow in elements that are going to be masked off at the end anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, addition with NaT
should never overflow, since the result will always be NaT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair enough. BTW, is it fair to assume that all integers are int64_t
and just check for int64
overflow?
2078ca0
to
d295470
Compare
Current coverage is 85.26% (diff: 100%)@@ master #14237 diff @@
==========================================
Files 140 140
Lines 50579 50584 +5
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43121 43128 +7
+ Misses 7458 7456 -2
Partials 0 0
|
d295470
to
37e8185
Compare
Yes, On Mon, Sep 19, 2016 at 9:43 AM, gfyoung [email protected] wrote:
|
6cf3e96
to
ef69eb0
Compare
@chris-b1 : Added the |
@gfyoung It seems to me that in your test you are only checking Timedelta + Timestamp, but those already error on overflow now. Shouldn't you check for the TimedeltaIndex case ? EDIT: Ah, I think you forgot the Further, can you show a perf comparison? |
------ | ||
OverflowError if a + b exceeds the maximum int64 value. | ||
""" | ||
if np.iinfo(np.int64).max - b < a: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the numpy c-api version of this so it's a compile time constant NPY_MAX_INT64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-b1 : How do I import this constant?
@@ -1348,6 +1348,56 @@ cdef inline float64_t _median_linear(float64_t* a, int n): | |||
return result | |||
|
|||
|
|||
cdef int64_t _checked_add_scalars(int64_t a, int64_t b) except -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except -1
won't work here since -1 is a valid result. Might need to a use an out parameter (int* valid
) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh. Good point. However, @jorisvandenbossche 's vectorization should make this moot.
result = [] | ||
|
||
for i in arr: | ||
result.append(_checked_add_scalars(i, b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appending to a python list like this will be slow, allocate a result array and place values into that (look how other functions in this file do it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @jorisvandenbossche 's vectorization suggestion should resolve that.
I don't think cython is needed here
|
result = [] | ||
|
||
for i in arr: | ||
result.append(_checked_add_scalars(i, b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferably this would be a nogil
loop too (with an escape hatch for the exception)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @jorisvandenbossche 's vectorization suggestion should resolve that.
@jreback IIUC |
This is safe here
|
@jorisvandenbossche : Perf comparison of what exactly? |
Before/after of such a TimedeltaIndex addition (as the checking will give a certain overhead) |
I think the check that you do in the cython code ( |
@jorisvandenbossche : Writing it in Cython should provide a perf boost (we know the data types). However, your point about vectorization is well taken and would take care of some of the issues addressed by @chris-b1 . |
ef69eb0
to
cac0f88
Compare
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: arr = np.arange(1000000)
In [4]: td = pd.to_timedelta(arr)
In [5]: ts = pd.Timestamp('2000')
In [6]: %timeit -n 1000 ts + td
|
Cython doesn't really make sense with the current implementation, because you can do this with vectorized NumPy functions already. In the current implementation it's entirely pointless because you're just calling Python methods/functions. If you need the loop, then yes, you should go for Cython. |
@shoyer : IIUC, I think your understanding of "current implementation" is a little dated. I made changes in the meantime. Also, implementing in Cython as I said makes sense because we know the data types with which we will be operating. |
@gfyoung Nope, every line in your current Cython code is calling back to Python in order to use NumPy: def _checked_add_arr_scalar(ndarray[int64_t] arr, int64_t b):
if (np.iinfo(np.int64).max - b < arr).any(): # vectorized arithmetic, numpy.ndarray.any
raise OverflowError("Python int too large to " # python error
"convert to C long")
return arr + b # more vectorized arithmetic You could do most of this from NumPy's C-API by calling functions like |
You would add a
http://stackoverflow.com/questions/199333/how-to-detect-integer-overflow-in-c-c |
@chris-b1 : Consider the case where |
|
And as you're looking at this, my opinion is still that the error message should be changed to accurately describe what's happening. Error messages are more documentation than API, so no need to use the same message as other overflows. |
@chris-b1 : Let's table the error message for now. I'm trying to focus on patching this bug first. Your answer above about broadcasting doesn't quite answer my question. I'm trying to point out that your check for |
I'm not following you on the element-by-element (this is element by element), but I did have a comparison backwards, should be:
Note this won't catch underflows, that check would be:
|
@chris-b1 : My point is the following: is |
How about you try it? with broadcasting it will work with any combo of arrays/scalars. |
Let's put the check in a clearer form: if ((b > 0) & ((np.iinfo(np.int64).max - b) < a)).any() or
((b < 0) & ((np.iinfo(np.int64).min - b) > a)).any():
raise OverflowError(...) It's that what you are saying? |
@gfyoung Can you give a reproducible example that does not work? |
>>> import pandas.core.nanops as nanops
>>> import numpy as np
>>>
>>> a = np.array([1])
>>> b = np.array([np.iinfo(np.int64).min)])
>>> nanops._checked_add_with_arr(a, b) # Should not fail.
...
OverflowError: ... This is one of several ways to break the current implementation. |
The suggestion of @chris-b1 of above works fine on this example:
|
@jorisvandenbossche : True, but you have to also check for overflow in the opposite direction as well i.e. what happens if you add |
Yes, but you asked about broadcasting / loop element by element. I am just saying that the example code of @chris-b1 works fine for scalars/arrays, not that it is sufficient as a check for all cases. It is indeed true that making the check comprehensive for all cases will include several checks. |
@jorisvandenbossche : Fair enough. I will put up a PR to separately patch this first. |
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and TimeDelta). Follow-up to pandas-devgh-14453.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.
Title is self-explanatory. Closes #14068.