BUG: Check for overflow in TimedeltaIndex addition. #14237

gfyoung · 2016-09-16T19:05:55Z

Title is self-explanatory. Closes #14068.

gfyoung · 2016-09-16T19:13:16Z

There is a similar bug for adding TimedeltaIndex and similar time delta objects with each other. I will add this as a separate commit but wanted to make sure my patch for the original issue passes.

The slightly tricky part for this patch is how we check for overflow. For example, if we look at this function here, we replace certain elements with tslib.iNaT. Should we check for overflow in the locations where they are supposed to be tslib.iNaT? Or ignore them?

chris-b1 · 2016-09-16T20:35:23Z

pandas/tseries/tdi.py

+            # Since Timestamp objects can never have a negative value,
+            # we just need to check that the elements in result are also
+            # nonnegative when their corresponding elements in i8 are.
+            for x, y in zip(i8, result):


I don't think this is performant - shouldn't this really be done via a checked_add loop in cython (something like the approach here)? Then you could use the same elsewhere.

Might also need to be an option to turn off the checking? (e.g. do we also check on plain int64s)?

@chris-b1 : That probably is possible, but that first sentence in your link seems to be somewhat discouraging. Also, what is the rational for turning off the checking? Or rather, why would overflow be a good thing in this context?

To that point overflow isn't guaranteed produce a negative number here - it is undefined behavior in C, so impl defined. But it is possible to check for it, simpler version here

re: option - it's never a good thing, just meant that you may not always want to pay the perf penalty for it, along the lines of bounds checking, and (presumably) why numpy doesn't do it on regular ints.

@chris-b1 : Well at the moment, numpy disregard for overflow is what is causing us headaches here. 😠 Also, the solutions you have provided are on an element by element basis, so where is the performance boost since we have to check an entire array of results?

You have to check before the addition, so what I meant is replacing the addition above with somethin like
result = algos._checked_add(i8, other.value)

where _checked_add is a cython func that does the checking/addition inline

Right, of course. The Cython part escaped me briefly there for some reason.

Let me first placate the Appveyor before I attempt the optimization.

Took a fresh look, and it seems like my question regarding masking is also relevant to this patch as well. Should we be complaining about overflow in elements that are going to be masked off at the end anyways?

No, addition with NaT should never overflow, since the result will always be NaT.

Ah, fair enough. BTW, is it fair to assume that all integers are int64_t and just check for int64 overflow?

codecov-io · 2016-09-16T22:15:06Z

Current coverage is 85.26% (diff: 100%)

Merging #14237 into master will increase coverage by <.01%

@@             master     #14237   diff @@
==========================================
  Files           140        140          
  Lines         50579      50584     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43121      43128     +7   
+ Misses         7458       7456     -2   
  Partials          0          0

Powered by Codecov. Last update 7dedbed...77effde

chris-b1 · 2016-09-19T14:47:06Z

Yes, Period, Timedelta and Timestamp are always backed by a int64_t
value.

On Mon, Sep 19, 2016 at 9:43 AM, gfyoung [email protected] wrote:

@gfyoung commented on this pull request.

In pandas/tseries/tdi.py #14237:
@@ -344,6 +344,17 @@ def _add_datelike(self, other):
other = Timestamp(other)
i8 = self.asi8
result = i8 + other.value
       # gh-14068: there is the possibility of addition overflow,
       # which occurs when we add two very large positive numbers,
       # resulting in a negative number.
       #
       # Since Timestamp objects can never have a negative value,
       # we just need to check that the elements in result are also
       # nonnegative when their corresponding elements in i8 are.
       for x, y in zip(i8, result):
Ah, fair enough. BTW, is it fair to assume that all integers are int64_t
and just check for int64 overflow?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#14237, or mute the thread
https://github.com/notifications/unsubscribe-auth/AB1b_KjV1qiCg6QPdGQt65b3AayiTozxks5qrp-LgaJpZM4J_QBW
.

gfyoung · 2016-09-20T06:10:35Z

@chris-b1 : Added the algos implementation for _checked_add, and Travis + Appveyor are happy.

jorisvandenbossche · 2016-09-20T07:32:35Z

@gfyoung It seems to me that in your test you are only checking Timedelta + Timestamp, but those already error on overflow now. Shouldn't you check for the TimedeltaIndex case ? EDIT: Ah, I think you forgot the [...] in the third and fourth case ?

Further, can you show a perf comparison?

chris-b1 · 2016-09-20T10:08:57Z

pandas/algos.pyx

+    ------
+    OverflowError if a + b exceeds the maximum int64 value.
+    """
+    if np.iinfo(np.int64).max - b < a:


Use the numpy c-api version of this so it's a compile time constant NPY_MAX_INT64

@chris-b1 : How do I import this constant?

chris-b1 · 2016-09-20T10:11:13Z

pandas/algos.pyx

@@ -1348,6 +1348,56 @@ cdef inline float64_t _median_linear(float64_t* a, int n):
    return result


+cdef int64_t _checked_add_scalars(int64_t a, int64_t b) except -1:


except -1 won't work here since -1 is a valid result. Might need to a use an out parameter (int* valid) instead.

Argh. Good point. However, @jorisvandenbossche 's vectorization should make this moot.

chris-b1 · 2016-09-20T10:12:46Z

pandas/algos.pyx

+    result = []
+
+    for i in arr:
+        result.append(_checked_add_scalars(i, b))


Appending to a python list like this will be slow, allocate a result array and place values into that (look how other functions in this file do it).

I think @jorisvandenbossche 's vectorization suggestion should resolve that.

jreback · 2016-09-20T10:14:02Z

I don't think cython is needed here

In [11]: arr = pd.to_datetime([0,np.iinfo(np.int64).max-1,pd.lib.iNaT])

In [12]: arr
Out[12]: DatetimeIndex(['1970-01-01 00:00:00', '2262-04-11 23:47:16.854775806', 'NaT'], dtype='datetime64[ns]', freq=None)

In [13]: arr = pd.to_datetime([0,np.iinfo(np.int64).max-1,pd.lib.iNaT]).asi8

In [14]: arr
Out[14]: array([                   0,  9223372036854775806, -9223372036854775808])

In [15]: arr+10
Out[15]: array([                  10, -9223372036854775800, -9223372036854775798])

In [16]: ((arr+10) < 0) & (arr != pd.lib.iNaT)
Out[16]: array([False,  True, False], dtype=bool)

chris-b1 · 2016-09-20T10:14:04Z

pandas/algos.pyx

+    result = []
+
+    for i in arr:
+        result.append(_checked_add_scalars(i, b))


Preferably this would be a nogil loop too (with an escape hatch for the exception)

I think @jorisvandenbossche 's vectorization suggestion should resolve that.

chris-b1 · 2016-09-20T10:18:49Z

@jreback IIUC [15] is technically undefined behavior (sign int overflow in C) - although maybe all the compilers python is built on wrap like that?

jreback · 2016-09-20T10:24:39Z

@chris-b1 I am not sure.

cc @charris
@shoyer

any idea if this is always true?

jreback · 2016-09-20T10:27:28Z

This is safe here

In [25]: ((np.iinfo(np.int64).max-arr+10) > 0) & (arr != pd.lib.iNaT)
Out[25]: array([False,  True, False], dtype=bool)

gfyoung · 2016-09-20T14:01:50Z

@jorisvandenbossche : Perf comparison of what exactly?

jorisvandenbossche · 2016-09-20T14:14:43Z

Perf comparison of what exactly?

Before/after of such a TimedeltaIndex addition (as the checking will give a certain overhead)

gfyoung · 2016-09-20T14:15:22Z

@jreback : BTW, I think @chris-b1 suggested Cython for perf reasons, not for correctness (that was assumed) IIUC.

jorisvandenbossche · 2016-09-20T14:16:34Z

I think the check that you do in the cython code (np.iinfo(np.int64).max - b < a) can also be done vectorized in plain numpy? (instead of the looping you had initially) In that case cython is maybe indeed not needed.

gfyoung · 2016-09-20T14:28:31Z

@jorisvandenbossche : Writing it in Cython should provide a perf boost (we know the data types). However, your point about vectorization is well taken and would take care of some of the issues addressed by @chris-b1 .

gfyoung · 2016-09-20T14:56:42Z

@jorisvandenbossche :

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: arr = np.arange(1000000)
In [4]: td = pd.to_timedelta(arr)
In [5]: ts = pd.Timestamp('2000')
In [6]: %timeit -n 1000 ts + td

master: 1000 loops, best of 3: 1.92 ms per loop
PR: 1000 loops, best of 3: 2.09 ms per loop

shoyer · 2016-09-20T16:16:44Z

Cython doesn't really make sense with the current implementation, because you can do this with vectorized NumPy functions already. In the current implementation it's entirely pointless because you're just calling Python methods/functions. If you need the loop, then yes, you should go for Cython.

gfyoung · 2016-09-20T16:19:41Z

@shoyer : IIUC, I think your understanding of "current implementation" is a little dated. I made changes in the meantime. Also, implementing in Cython as I said makes sense because we know the data types with which we will be operating.

shoyer · 2016-09-20T16:28:18Z

@gfyoung Nope, every line in your current Cython code is calling back to Python in order to use NumPy:

def _checked_add_arr_scalar(ndarray[int64_t] arr, int64_t b):
    if (np.iinfo(np.int64).max - b < arr).any():  # vectorized arithmetic, numpy.ndarray.any
         raise OverflowError("Python int too large to "  # python error
                            "convert to C long")
     return arr + b  # more vectorized arithmetic

You could do most of this from NumPy's C-API by calling functions like PyArray_Any but that would be pretty pointless.

chris-b1 · 2016-09-28T21:26:55Z

You would add a > 0 guard, like this;

(b > 0) & ((np.iinfo(np.int64).max - b) > a)

http://stackoverflow.com/questions/199333/how-to-detect-integer-overflow-in-c-c

gfyoung · 2016-09-28T21:28:42Z

@chris-b1 : Consider the case where b = np.array([1, -1]). Unless you are suggesting to this element by element? Then that would make more sense.

chris-b1 · 2016-09-28T21:30:53Z

> will broadcast, so yes, it is element by element

chris-b1 · 2016-09-28T21:34:04Z

And as you're looking at this, my opinion is still that the error message should be changed to accurately describe what's happening. Error messages are more documentation than API, so no need to use the same message as other overflows.

gfyoung · 2016-09-28T21:37:42Z

@chris-b1 : Let's table the error message for now. I'm trying to focus on patching this bug first. Your answer above about broadcasting doesn't quite answer my question. I'm trying to point out that your check for > 0 is insufficient unless we decide to explicitly loop through element by element. Is that what you're implying?

chris-b1 · 2016-09-28T21:48:23Z

I'm not following you on the element-by-element (this is element by element), but I did have a comparison backwards, should be:

(b > 0) & ((np.iinfo(np.int64).max - b) < a)

Note this won't catch underflows, that check would be:

(b < 0) & ((np.iinfo(np.int64).min - b) > a)

gfyoung · 2016-09-28T21:51:27Z

@chris-b1 : My point is the following: is a and b in your code scalars ONLY? Because I'm almost certain that first check will fail if b is an array.

chris-b1 · 2016-09-28T21:56:09Z

How about you try it? with broadcasting it will work with any combo of arrays/scalars.

gfyoung · 2016-09-28T21:58:19Z

Let's put the check in a clearer form:

if ((b > 0) & ((np.iinfo(np.int64).max - b) < a)).any() or
   ((b < 0) & ((np.iinfo(np.int64).min - b) > a)).any():
    raise OverflowError(...)

It's that what you are saying?

jorisvandenbossche · 2016-09-28T22:00:38Z

@gfyoung Can you give a reproducible example that does not work?

gfyoung · 2016-09-28T22:05:22Z

@jorisvandenbossche :

>>> import pandas.core.nanops as nanops
>>> import numpy as np
>>>
>>> a = np.array([1])
>>> b = np.array([np.iinfo(np.int64).min)])
>>> nanops._checked_add_with_arr(a, b)  # Should not fail.
...
OverflowError: ...

This is one of several ways to break the current implementation.
Also, take a look at the examples I gave earlier.

jorisvandenbossche · 2016-09-28T22:11:03Z

The suggestion of @chris-b1 of above works fine on this example:

In [19]: (b > 0) & ((np.iinfo(np.int64).max - b) < a)
Out[19]: array([False], dtype=bool)

gfyoung · 2016-09-28T22:12:53Z

@jorisvandenbossche : True, but you have to also check for overflow in the opposite direction as well i.e. what happens if you add np.iinfo(np.int64).min to itself? The current check that you provided by itself will not work.

jorisvandenbossche · 2016-09-28T22:16:04Z

Yes, but you asked about broadcasting / loop element by element. I am just saying that the example code of @chris-b1 works fine for scalars/arrays, not that it is sufficient as a check for all cases. It is indeed true that making the check comprehensive for all cases will include several checks.
And of course that will impact the performance ...

gfyoung · 2016-09-28T22:17:31Z

@jorisvandenbossche : Fair enough. I will put up a PR to separately patch this first.

Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and TimeDelta). Follow-up to pandas-devgh-14453.

Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.

Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.

Expands checked-add array addition introduced in gh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to gh-14453. In addition, move checked add function to core/algorithms.

Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.

chris-b1 reviewed Sep 16, 2016

View reviewed changes

gfyoung force-pushed the add-overflow branch from 2078ca0 to d295470 Compare September 16, 2016 22:15

gfyoung force-pushed the add-overflow branch from d295470 to 37e8185 Compare September 18, 2016 17:53

gfyoung force-pushed the add-overflow branch 2 times, most recently from 6cf3e96 to ef69eb0 Compare September 20, 2016 04:57

chris-b1 reviewed Sep 20, 2016

View reviewed changes

jreback added Bug Timedelta Timedelta data type labels Sep 20, 2016

gfyoung force-pushed the add-overflow branch from ef69eb0 to cac0f88 Compare September 20, 2016 14:40

gfyoung mentioned this pull request Sep 29, 2016

BUG: Patch Checked Add Method #14324

Closed

gfyoung mentioned this pull request Dec 7, 2016

BUG: Prevent addition overflow with TimedeltaIndex #14816

Merged

		@@ -1348,6 +1348,56 @@ cdef inline float64_t _median_linear(float64_t* a, int n):
		return result


		cdef int64_t _checked_add_scalars(int64_t a, int64_t b) except -1:

BUG: Check for overflow in TimedeltaIndex addition. #14237

BUG: Check for overflow in TimedeltaIndex addition. #14237

Conversation

gfyoung commented Sep 16, 2016

gfyoung commented Sep 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Sep 16, 2016 • edited Loading

Current coverage is 85.26% (diff: 100%)

chris-b1 commented Sep 19, 2016

@gfyoung commented on this pull request.

gfyoung commented Sep 20, 2016

jorisvandenbossche commented Sep 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris-b1 commented Sep 20, 2016

jreback commented Sep 20, 2016 • edited Loading

jreback commented Sep 20, 2016

gfyoung commented Sep 20, 2016

jorisvandenbossche commented Sep 20, 2016

gfyoung commented Sep 20, 2016

jorisvandenbossche commented Sep 20, 2016

gfyoung commented Sep 20, 2016

gfyoung commented Sep 20, 2016

shoyer commented Sep 20, 2016

gfyoung commented Sep 20, 2016 • edited Loading

shoyer commented Sep 20, 2016 • edited Loading

chris-b1 commented Sep 28, 2016

gfyoung commented Sep 28, 2016

chris-b1 commented Sep 28, 2016

chris-b1 commented Sep 28, 2016 • edited Loading

gfyoung commented Sep 28, 2016 • edited Loading

chris-b1 commented Sep 28, 2016 • edited Loading

gfyoung commented Sep 28, 2016

chris-b1 commented Sep 28, 2016

gfyoung commented Sep 28, 2016 • edited Loading

jorisvandenbossche commented Sep 28, 2016

gfyoung commented Sep 28, 2016 • edited Loading

jorisvandenbossche commented Sep 28, 2016

gfyoung commented Sep 28, 2016

jorisvandenbossche commented Sep 28, 2016 • edited Loading

gfyoung commented Sep 28, 2016

gfyoung Sep 16, 2016 •

edited

Loading

codecov-io commented Sep 16, 2016 •

edited

Loading

jorisvandenbossche commented Sep 20, 2016 •

edited

Loading

jreback commented Sep 20, 2016 •

edited

Loading

gfyoung commented Sep 20, 2016 •

edited

Loading

shoyer commented Sep 20, 2016 •

edited

Loading

chris-b1 commented Sep 28, 2016 •

edited

Loading

gfyoung commented Sep 28, 2016 •

edited

Loading

chris-b1 commented Sep 28, 2016 •

edited

Loading

gfyoung commented Sep 28, 2016 •

edited

Loading

gfyoung commented Sep 28, 2016 •

edited

Loading

jorisvandenbossche commented Sep 28, 2016 •

edited

Loading