Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Check for overflow in TimedeltaIndex addition. #14237

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Sep 16, 2016

Title is self-explanatory. Closes #14068.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 16, 2016

There is a similar bug for adding TimedeltaIndex and similar time delta objects with each other. I will add this as a separate commit but wanted to make sure my patch for the original issue passes.

The slightly tricky part for this patch is how we check for overflow. For example, if we look at this function here, we replace certain elements with tslib.iNaT. Should we check for overflow in the locations where they are supposed to be tslib.iNaT? Or ignore them?

# Since Timestamp objects can never have a negative value,
# we just need to check that the elements in result are also
# nonnegative when their corresponding elements in i8 are.
for x, y in zip(i8, result):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is performant - shouldn't this really be done via a checked_add loop in cython (something like the approach here)? Then you could use the same elsewhere.

Might also need to be an option to turn off the checking? (e.g. do we also check on plain int64s)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 : That probably is possible, but that first sentence in your link seems to be somewhat discouraging. Also, what is the rational for turning off the checking? Or rather, why would overflow be a good thing in this context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To that point overflow isn't guaranteed produce a negative number here - it is undefined behavior in C, so impl defined. But it is possible to check for it, simpler version here

re: option - it's never a good thing, just meant that you may not always want to pay the perf penalty for it, along the lines of bounds checking, and (presumably) why numpy doesn't do it on regular ints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 : Well at the moment, numpy disregard for overflow is what is causing us headaches here. 😠 Also, the solutions you have provided are on an element by element basis, so where is the performance boost since we have to check an entire array of results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to check before the addition, so what I meant is replacing the addition above with somethin like
result = algos._checked_add(i8, other.value)

where _checked_add is a cython func that does the checking/addition inline

Copy link
Member Author

@gfyoung gfyoung Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, of course. The Cython part escaped me briefly there for some reason.

Let me first placate the Appveyor before I attempt the optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a fresh look, and it seems like my question regarding masking is also relevant to this patch as well. Should we be complaining about overflow in elements that are going to be masked off at the end anyways?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, addition with NaT should never overflow, since the result will always be NaT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair enough. BTW, is it fair to assume that all integers are int64_t and just check for int64 overflow?

@codecov-io
Copy link

codecov-io commented Sep 16, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14237 into master will increase coverage by <.01%

@@             master     #14237   diff @@
==========================================
  Files           140        140          
  Lines         50579      50584     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43121      43128     +7   
+ Misses         7458       7456     -2   
  Partials          0          0          

Powered by Codecov. Last update 7dedbed...77effde

@chris-b1
Copy link
Contributor

Yes, Period, Timedelta and Timestamp are always backed by a int64_t
value.

On Mon, Sep 19, 2016 at 9:43 AM, gfyoung [email protected] wrote:

@gfyoung commented on this pull request.

In pandas/tseries/tdi.py #14237:

@@ -344,6 +344,17 @@ def _add_datelike(self, other):
other = Timestamp(other)
i8 = self.asi8
result = i8 + other.value

  •        # gh-14068: there is the possibility of addition overflow,
    
  •        # which occurs when we add two very large positive numbers,
    
  •        # resulting in a negative number.
    
  •        #
    
  •        # Since Timestamp objects can never have a negative value,
    
  •        # we just need to check that the elements in result are also
    
  •        # nonnegative when their corresponding elements in i8 are.
    
  •        for x, y in zip(i8, result):
    

Ah, fair enough. BTW, is it fair to assume that all integers are int64_t
and just check for int64 overflow?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#14237, or mute the thread
https://github.com/notifications/unsubscribe-auth/AB1b_KjV1qiCg6QPdGQt65b3AayiTozxks5qrp-LgaJpZM4J_QBW
.

@gfyoung gfyoung force-pushed the add-overflow branch 2 times, most recently from 6cf3e96 to ef69eb0 Compare September 20, 2016 04:57
@gfyoung
Copy link
Member Author

gfyoung commented Sep 20, 2016

@chris-b1 : Added the algos implementation for _checked_add, and Travis + Appveyor are happy.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 20, 2016

@gfyoung It seems to me that in your test you are only checking Timedelta + Timestamp, but those already error on overflow now. Shouldn't you check for the TimedeltaIndex case ? EDIT: Ah, I think you forgot the [...] in the third and fourth case ?

Further, can you show a perf comparison?

------
OverflowError if a + b exceeds the maximum int64 value.
"""
if np.iinfo(np.int64).max - b < a:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the numpy c-api version of this so it's a compile time constant NPY_MAX_INT64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 : How do I import this constant?

@@ -1348,6 +1348,56 @@ cdef inline float64_t _median_linear(float64_t* a, int n):
return result


cdef int64_t _checked_add_scalars(int64_t a, int64_t b) except -1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except -1 won't work here since -1 is a valid result. Might need to a use an out parameter (int* valid) instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh. Good point. However, @jorisvandenbossche 's vectorization should make this moot.

result = []

for i in arr:
result.append(_checked_add_scalars(i, b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appending to a python list like this will be slow, allocate a result array and place values into that (look how other functions in this file do it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jorisvandenbossche 's vectorization suggestion should resolve that.

@jreback
Copy link
Contributor

jreback commented Sep 20, 2016

I don't think cython is needed here

In [11]: arr = pd.to_datetime([0,np.iinfo(np.int64).max-1,pd.lib.iNaT])

In [12]: arr
Out[12]: DatetimeIndex(['1970-01-01 00:00:00', '2262-04-11 23:47:16.854775806', 'NaT'], dtype='datetime64[ns]', freq=None)

In [13]: arr = pd.to_datetime([0,np.iinfo(np.int64).max-1,pd.lib.iNaT]).asi8

In [14]: arr
Out[14]: array([                   0,  9223372036854775806, -9223372036854775808])

In [15]: arr+10
Out[15]: array([                  10, -9223372036854775800, -9223372036854775798])

In [16]: ((arr+10) < 0) & (arr != pd.lib.iNaT)
Out[16]: array([False,  True, False], dtype=bool)

result = []

for i in arr:
result.append(_checked_add_scalars(i, b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferably this would be a nogil loop too (with an escape hatch for the exception)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jorisvandenbossche 's vectorization suggestion should resolve that.

@jreback jreback added Bug Timedelta Timedelta data type labels Sep 20, 2016
@chris-b1
Copy link
Contributor

@jreback IIUC [15] is technically undefined behavior (sign int overflow in C) - although maybe all the compilers python is built on wrap like that?

@jreback
Copy link
Contributor

jreback commented Sep 20, 2016

@chris-b1 I am not sure.

cc @charris
@shoyer

any idea if this is always true?

@jreback
Copy link
Contributor

jreback commented Sep 20, 2016

This is safe here

In [25]: ((np.iinfo(np.int64).max-arr+10) > 0) & (arr != pd.lib.iNaT)
Out[25]: array([False,  True, False], dtype=bool)

@gfyoung
Copy link
Member Author

gfyoung commented Sep 20, 2016

@jorisvandenbossche : Perf comparison of what exactly?

@jorisvandenbossche
Copy link
Member

Perf comparison of what exactly?

Before/after of such a TimedeltaIndex addition (as the checking will give a certain overhead)

@gfyoung
Copy link
Member Author

gfyoung commented Sep 20, 2016

@jreback : BTW, I think @chris-b1 suggested Cython for perf reasons, not for correctness (that was assumed) IIUC.

@jorisvandenbossche
Copy link
Member

I think the check that you do in the cython code (np.iinfo(np.int64).max - b < a) can also be done vectorized in plain numpy? (instead of the looping you had initially) In that case cython is maybe indeed not needed.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 20, 2016

@jorisvandenbossche : Writing it in Cython should provide a perf boost (we know the data types). However, your point about vectorization is well taken and would take care of some of the issues addressed by @chris-b1 .

@gfyoung
Copy link
Member Author

gfyoung commented Sep 20, 2016

@jorisvandenbossche :

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: arr = np.arange(1000000)
In [4]: td = pd.to_timedelta(arr)
In [5]: ts = pd.Timestamp('2000')
In [6]: %timeit -n 1000 ts + td

master: 1000 loops, best of 3: 1.92 ms per loop
PR: 1000 loops, best of 3: 2.09 ms per loop

@shoyer
Copy link
Member

shoyer commented Sep 20, 2016

Cython doesn't really make sense with the current implementation, because you can do this with vectorized NumPy functions already. In the current implementation it's entirely pointless because you're just calling Python methods/functions. If you need the loop, then yes, you should go for Cython.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 20, 2016

@shoyer : IIUC, I think your understanding of "current implementation" is a little dated. I made changes in the meantime. Also, implementing in Cython as I said makes sense because we know the data types with which we will be operating.

@shoyer
Copy link
Member

shoyer commented Sep 20, 2016

@gfyoung Nope, every line in your current Cython code is calling back to Python in order to use NumPy:

def _checked_add_arr_scalar(ndarray[int64_t] arr, int64_t b):
    if (np.iinfo(np.int64).max - b < arr).any():  # vectorized arithmetic, numpy.ndarray.any
         raise OverflowError("Python int too large to "  # python error
                            "convert to C long")
     return arr + b  # more vectorized arithmetic

You could do most of this from NumPy's C-API by calling functions like PyArray_Any but that would be pretty pointless.

@chris-b1
Copy link
Contributor

You would add a > 0 guard, like this;

(b > 0) & ((np.iinfo(np.int64).max - b) > a)

http://stackoverflow.com/questions/199333/how-to-detect-integer-overflow-in-c-c

@gfyoung
Copy link
Member Author

gfyoung commented Sep 28, 2016

@chris-b1 : Consider the case where b = np.array([1, -1]). Unless you are suggesting to this element by element? Then that would make more sense.

@chris-b1
Copy link
Contributor

> will broadcast, so yes, it is element by element

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 28, 2016

And as you're looking at this, my opinion is still that the error message should be changed to accurately describe what's happening. Error messages are more documentation than API, so no need to use the same message as other overflows.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 28, 2016

@chris-b1 : Let's table the error message for now. I'm trying to focus on patching this bug first. Your answer above about broadcasting doesn't quite answer my question. I'm trying to point out that your check for > 0 is insufficient unless we decide to explicitly loop through element by element. Is that what you're implying?

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 28, 2016

I'm not following you on the element-by-element (this is element by element), but I did have a comparison backwards, should be:

(b > 0) & ((np.iinfo(np.int64).max - b) < a)

Note this won't catch underflows, that check would be:

(b < 0) & ((np.iinfo(np.int64).min - b) > a)

@gfyoung
Copy link
Member Author

gfyoung commented Sep 28, 2016

@chris-b1 : My point is the following: is a and b in your code scalars ONLY? Because I'm almost certain that first check will fail if b is an array.

@chris-b1
Copy link
Contributor

How about you try it? with broadcasting it will work with any combo of arrays/scalars.

@gfyoung
Copy link
Member Author

gfyoung commented Sep 28, 2016

Let's put the check in a clearer form:

if ((b > 0) & ((np.iinfo(np.int64).max - b) < a)).any() or
   ((b < 0) & ((np.iinfo(np.int64).min - b) > a)).any():
    raise OverflowError(...)

It's that what you are saying?

@jorisvandenbossche
Copy link
Member

@gfyoung Can you give a reproducible example that does not work?

@gfyoung
Copy link
Member Author

gfyoung commented Sep 28, 2016

@jorisvandenbossche :

>>> import pandas.core.nanops as nanops
>>> import numpy as np
>>>
>>> a = np.array([1])
>>> b = np.array([np.iinfo(np.int64).min)])
>>> nanops._checked_add_with_arr(a, b)  # Should not fail.
...
OverflowError: ...

This is one of several ways to break the current implementation.
Also, take a look at the examples I gave earlier.

@jorisvandenbossche
Copy link
Member

The suggestion of @chris-b1 of above works fine on this example:

In [19]: (b > 0) & ((np.iinfo(np.int64).max - b) < a)
Out[19]: array([False], dtype=bool)

@gfyoung
Copy link
Member Author

gfyoung commented Sep 28, 2016

@jorisvandenbossche : True, but you have to also check for overflow in the opposite direction as well i.e. what happens if you add np.iinfo(np.int64).min to itself? The current check that you provided by itself will not work.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 28, 2016

Yes, but you asked about broadcasting / loop element by element. I am just saying that the example code of @chris-b1 works fine for scalars/arrays, not that it is sufficient as a check for all cases. It is indeed true that making the check comprehensive for all cases will include several checks.
And of course that will impact the performance ...

@gfyoung
Copy link
Member Author

gfyoung commented Sep 28, 2016

@jorisvandenbossche : Fair enough. I will put up a PR to separately patch this first.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 7, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and TimeDelta). Follow-up to pandas-devgh-14453.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 7, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 10, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 11, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 15, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 15, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 15, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 15, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 15, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.

In addition, move checked add function to core/algorithms.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 17, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.

In addition, move checked add function to core/algorithms.
jorisvandenbossche pushed a commit that referenced this pull request Dec 17, 2016
Expands checked-add array addition introduced in
gh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to gh-14453.

In addition, move checked add function to core/algorithms.
ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.

In addition, move checked add function to core/algorithms.
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.

In addition, move checked add function to core/algorithms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants