Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Patch Checked Add Method #14324

Closed
wants to merge 12 commits into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Sep 29, 2016

Title is self-explanatory. Follow-up to #14237.

Merged in PR #14453

@codecov-io
Copy link

codecov-io commented Sep 29, 2016

Current coverage is 85.25% (diff: 90.47%)

Merging #14324 into master will decrease coverage by 0.01%

@@             master     #14324   diff @@
==========================================
  Files           140        140          
  Lines         50634      50645    +11   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43173      43175     +2   
- Misses         7461       7470     +9   
  Partials          0          0          

Powered by Codecov. Last update d98e982...85ca49d

@jorisvandenbossche jorisvandenbossche added Bug Timedelta Timedelta data type labels Sep 29, 2016
@jorisvandenbossche
Copy link
Member

Although this is probably more correct, I don't think the performance degradation is allowable:

# master

In [1]: arr = np.arange(1000000)
   ...: td = pd.to_timedelta(arr)
   ...: ts = pd.Timestamp('2000')
   ...: 

In [2]: %timeit td + ts
100 loops, best of 3: 2.51 ms per loop
# PR

In [2]: %timeit td + ts
10 loops, best of 3: 41.1 ms per loop

The problem is in np.iinfo(np.int64).min - b > arr when b is positive (and probably in the other check of b is negative)

@gfyoung
Copy link
Member Author

gfyoung commented Sep 29, 2016

@jorisvandenbossche : I draw a hard line on correctness. This is impeding my ability to patch the other TDI overflow issues because there are tests that fail because of this bug. Perhaps it would instead be worthwhile to break the check up into two separate statements, though not sure that would help too much given that it's Python.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2016

try this. The key to this expression is lazy evaluation as its a compound expression. numexpr does this for free.

In [5]: arr = np.arange(1000000)
   ...: td = pd.to_timedelta(arr)
   ...: ts = pd.Timestamp('2000')
   ...: a = td.view('i8')
   ...: b = ts.value
   ...: 

In [7]: nmax = np.iinfo(np.int64).max 

In [8]: nmin = np.iinfo(np.int64).min

n [11]: ((b > 0) & (np.iinfo(np.int64).max - b < arr)).any() or((b < 0) & (np.iinfo(np.int64).min - b > arr)).any()
Out[11]: False

In [12]: %timeit ((b > 0) & (np.iinfo(np.int64).max - b < arr)).any() or((b < 0) & (np.iinfo(np.int64).min - b > arr)).any()
10 loops, best of 3: 57.7 ms per loop

In [14]: %timeit pd.eval('((b > 0) & (nmax - b < arr)) | ((b < 0) & (nmin - b > arr))').any()
100 loops, best of 3: 6.24 ms per loop

In [15]: pd.eval('((b > 0) & (nmax - b < arr)) | ((b < 0) & (nmin - b > arr))').any()
Out[15]: False

@jreback
Copy link
Contributor

jreback commented Sep 30, 2016

numba can do even better (but we don't have this in the quiver atm)

In [17]: def f(a,b,arr,nmax,nmin):
    ...:     return np.any((b > 0) & (nmax - b < arr)) | ((b < 0) & (nmin - b > arr))

In [18]: f = numba.jit(f,nopython=True)

In [19]: %timeit f(a,b,arr,nmax,nmin).any()
The slowest run took 59.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.28 ms per loop

In [20]: f(a,b,arr,nmax,nmin).any()
Out[20]: False

@jorisvandenbossche
Copy link
Member

Something like this is quite convoluted, but is a bit faster as the eval approach:

In [62]: %%timeit
    ...: b2 = np.broadcast_to(b, arr.shape)
    ...: mask1 = b2 > 0
    ...: mask2 = b2 < 0
    ...: if not mask1.any():
    ...:     to_raise = (np.iinfo(np.int64).min - b2 > arr).any()
    ...: elif not mask2.any():
    ...:     to_raise = (np.iinfo(np.int64).max - b2 < arr).any()
    ...: else:
    ...:     to_raise = (np.iinfo(np.int64).max - b2[mask1] < arr[mask1]).any() or (np.iinfo(np.int64).min - b2[mask1] > arr[mask1]).any()
    ...: 
100 loops, best of 3: 4.2 ms per loop

@gfyoung gfyoung force-pushed the add-overflow-patch branch 2 times, most recently from 9d31ec7 to ad4cd85 Compare October 2, 2016 20:19
@gfyoung
Copy link
Member Author

gfyoung commented Oct 4, 2016

@jorisvandenbossche : Are you proposing I use this more convoluted method then? Thoughts?

@gfyoung gfyoung force-pushed the add-overflow-patch branch 2 times, most recently from c395af3 to 760b1bb Compare October 7, 2016 21:21
@gfyoung
Copy link
Member Author

gfyoung commented Oct 9, 2016

@jorisvandenbossche : Some feedback on my question above would be greatly appreciated.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2016

@gfyoung yes pls use one of the methods proposed above

@gfyoung gfyoung force-pushed the add-overflow-patch branch 3 times, most recently from d0563a7 to cf4da12 Compare October 9, 2016 20:25
@gfyoung
Copy link
Member Author

gfyoung commented Oct 9, 2016

@jorisvandenbossche : Your solution is unfortunately not backwards compatible with older versions of numpy (e.g. 1.7.1) that don't have np.broadcast_to. Let me see if np.broadcast_arrays could work. If not, I'll have to go with @jreback suggestion.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2016

so put a conditional on the numpy version (or use the eval soln)

@gfyoung
Copy link
Member Author

gfyoung commented Oct 9, 2016

@jreback : I thought we don't do those kind of checks (or try to avoid them). In any case, I'm trying with another method as I mentioned above. IMO this isn't something THAT important that we need to check numpy versions.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2016

i don't see a reason not to do a check in the numpy version

or use the eval soln (which is fine but will fall back to numpy if numexpr is not installed)

@gfyoung
Copy link
Member Author

gfyoung commented Oct 9, 2016

@jreback : Fair enough. Also, I have no objections to your eval solution. I'm trying it right now, as np.broadcast_arrays (the alternative to np.broadcast_to) doesn't meet the bill.

@gfyoung gfyoung force-pushed the add-overflow-patch branch from cf4da12 to 08b6395 Compare October 9, 2016 21:30
@jorisvandenbossche
Copy link
Member

The eval solution will be very slow if you don't have numexpr installed, so I would prefer not to use it.

@gfyoung
Copy link
Member Author

gfyoung commented Oct 9, 2016

@jorisvandenbossche : Unfortunately, your solution is not backwards compatible. And if you don't like @jreback suggestion, then what do you suggest? Compared to my original change, the eval solution would at least be an improvement.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2016

just check the numpy version

i am not sure why you think this is an issue

@gfyoung
Copy link
Member Author

gfyoung commented Oct 9, 2016

@jreback : And then what, use the slower solution that I used before if the version is not compatible? Would it be preferable to fallback to my original solution or use the eval one you used?

@jreback
Copy link
Contributor

jreback commented Oct 9, 2016

yep

…projects (pandas-dev#14406)

As per [their blog post of the 27th April](https://blog.readthedocs.com/securing-subdomains/) ‘Securing subdomains’:

> Starting today, Read the Docs will start hosting projects from subdomains on the domain readthedocs.io, instead of on readthedocs.org. This change addresses some security concerns around site cookies while hosting user generated data on the same domain as our dashboard.

Test Plan: Manually visited all the links I’ve modified.
@gfyoung gfyoung force-pushed the add-overflow-patch branch from 896c9f1 to ba515fe Compare October 13, 2016 04:39
@jorisvandenbossche
Copy link
Member

This looks good to me.
@gfyoung final question: can you run the timedelta benchmark (test_add_td_ts) and compare this to 0.18 as well?

@jreback jreback added this to the 0.19.1 milestone Oct 13, 2016
@jreback
Copy link
Contributor

jreback commented Oct 13, 2016

lgtm

@gfyoung
Copy link
Member Author

gfyoung commented Oct 13, 2016

@jorisvandenbossche : Using the same setup as above, 100 loops, best of 3: 3.08 ms per loop

@jorisvandenbossche
Copy link
Member

And compared to 0.18? (on my machine I get 1.15 ms, so probably something of 3x slowdown)

@gfyoung gfyoung force-pushed the add-overflow-patch branch from ba515fe to ee1653b Compare October 13, 2016 14:16
@gfyoung
Copy link
Member Author

gfyoung commented Oct 13, 2016

@jorisvandenbossche : That timing information is for v0.18.0.

@jorisvandenbossche
Copy link
Member

@gfyoung That is a bit strange, because then you have 4.06 for this PR and 3.08 for 0.18. I get 5.11 vs 1.11, so a much larger difference.

Anyway, it's a rather large performance degredation, but we decided to go down this path.

@gfyoung gfyoung force-pushed the add-overflow-patch branch from ee1653b to de31e6a Compare October 14, 2016 17:17
Josh Owen and others added 5 commits October 15, 2016 09:37
…x (GH14327) (pandas-dev#14428)

Existing logic under "if level is not None:" assumed that index was a MultiIndex.
Now we check and also handle the case where an Index is passed in with a None grouper.
This resolves GH 14327
Deprecated back in `v0.17.0` <a href="https://github.com/pydata/pandas
/commit/987b7e7e586b8df1d127406c69e0a9094a1a5322">here</a>.  Seems
appropriate to carry though now.

Author: gfyoung <[email protected]>

Closes pandas-dev#13602 from gfyoung/remove-coerce-to-timedelta and squashes the following commits:

091af83 [gfyoung] CLN: Removed coerce param in pd.to_timedelta
closes pandas-dev#14369

Author: Brandon M. Burroughs <[email protected]>

Closes pandas-dev#14416 from brandonmburroughs/concat_with_axis_rows and squashes the following commits:

9aa260d [Brandon M. Burroughs] Changing axis handling to depend on object passed
49442be [Brandon M. Burroughs] Using dataframe _get_axis_number instance method
64702fb [Brandon M. Burroughs] Updating documentation for concat
fdd5260 [Brandon M. Burroughs] Removing duplicate expected dfs
3f08b07 [Brandon M. Burroughs] Adding concat tests for axis 0, 1, and 'index'
cf3f998 [Brandon M. Burroughs] Adding ValueError test for concat Series axis 'columns'
a6694b9 [Brandon M. Burroughs] Updating documentation
584ebd2 [Brandon M. Burroughs] BUG: Allow concat to take string axis names
@gfyoung gfyoung force-pushed the add-overflow-patch branch from de31e6a to 94076aa Compare October 15, 2016 20:00
jreback and others added 2 commits October 15, 2016 16:02
1) Add checks to ensure that add overflow does not
   occur both in the positive or negative directions.
2) Add benchmarks to ensure that operations involving
   this checked add function are significantly impacted.
@gfyoung gfyoung force-pushed the add-overflow-patch branch from 94076aa to 85ca49d Compare October 16, 2016 19:30
@jreback
Copy link
Contributor

jreback commented Oct 19, 2016

@gfyoung you may need to close this and open a new PR. It seems that people are unable to rebase and push correctly to the original pandas repo. (further appevey was disconnected) so might be a good test.

@gfyoung
Copy link
Member Author

gfyoung commented Oct 19, 2016

@jreback : Ah, I can see that now. Will close and repost.

@gfyoung gfyoung closed this Oct 19, 2016
@gfyoung gfyoung deleted the add-overflow-patch branch October 19, 2016 16:12
@jorisvandenbossche
Copy link
Member

@jreback BTW, I contacted github about it, and it should be solved now (was an error on their side). So the closing/open new PR should no longer be needed for all older PRs.

@jreback
Copy link
Contributor

jreback commented Oct 19, 2016

@jorisvandenbossche good to know! thanks!

@gfyoung
Copy link
Member Author

gfyoung commented Oct 19, 2016

A little to late now but oh well. 😄 New PR in #14453.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants