BUG: Fix truediv numexpr error #3764

jtratner · 2013-06-05T22:40:00Z

Fixes a number of items relating to accelerated arithmetic in frame.

__truediv__ had been set up to use numexpr, but this happened by passingthe division operator as the string. numexpr's evaluate only checked 2 frames up, which meant that it picked up the division setting fromframe.py and would do floor/integer division when both inputs were integers.You'd only see that issue with a dataframe large enough to trigger numexpr evaluation (>10000 cells)

This adds test cases to test_expression.py that exhibit this failure under thePython2.7 full deps test. The testcases only test Series and DataFrame (though it looks like neitherSeries nor Panel use numexpr). It doesn't fail under Python 3 because integer division is totally gone.

Now evaluate, _evaluate_standard and _evaluate_numexpr all accept extra keyword arguments, which are passed to numexpr.evaluate.

The test case is currently a separate commit that fails. I wasn't sure whether I should have combined it with the bugfix commit or not. Happyto change it if that's more appropriate.

jreback · 2013-06-05T23:05:56Z

didn't realized ne could do truediv...good to know

fyi....easy to mod panel to use ne, series should be straightforward as well....

going to add an issue about that

jreback · 2013-06-05T23:07:00Z

see #3765

jreback · 2013-06-05T23:52:36Z

FYI, i just pushed this: c278ca6, because of the original failing test

numexpr is only installed on 1 travis build (the 3.2), BUT it is under numpy 1.6.1 which causes weirdness....hmmm

jtratner · 2013-06-05T23:57:04Z

@jreback so why did it fail on 2.7? that test only runs if numexpr is installed. Also, probably a good idea to enable numexpr for one of the 2.x builds, right?

jreback · 2013-06-06T00:04:38Z

it failed on 2.7, weird (failed for 3.2 for me).....

...just testing enabling numexpr for all builds

jtratner · 2013-06-06T00:05:37Z

@jreback check out the travis ci link: https://travis-ci.org/jtratner/pandas/builds/7822069 .

jtratner · 2013-06-06T00:06:42Z

(that link is for a different build, obviously). Just for clarity - what needs to happen to incorporate this fix? Or are you saying this still fails in 3.2?

jreback · 2013-06-06T00:07:19Z

that s your test though (numexpr IS enabled on that build)...look at the print_versions at the very bottom (you have to click it)

jreback · 2013-06-06T00:08:27Z

I can't believe this never failed before...really weird https://www.travis-ci.org/jreback/pandas/jobs/7824130
fixed now though

jtratner · 2013-06-06T00:13:21Z

@jreback fixed the test to not use div. obviously test_expressions wasn't running in python 3.

jreback · 2013-06-06T00:15:02Z

yep.....numexpr was only recently turned on....since we are now using it practially everythwere going to turn it on

jtratner · 2013-06-06T00:22:16Z

@jreback so yeah, I'm happy to help out. probably can't get to putting this into Series and Panel today, but hopefully by the end of the week. What's your position on putting in new tests with fixes? Should I have put the test together with the fix / after the fix? Just wondering so I have a sense of how to contribute well to this project.

jreback · 2013-06-06T00:30:38Z

I usually write the test first, make sure it breaks, then put the fix in
(it could be the same commit or not)...

if I get too many commits or I am 'trying' to fix something...then I just squash them at the end...

e.g. the HDFStore py3k this was like 20 commits...squashed back into 5....had to try a bunch of things!

jtratner · 2013-06-06T00:35:00Z

@jreback is there any way to confirm that numexpr was triggered? I added some print statements for my own testing, but I'm unclear whether test_expressions.py, as it stands now, would fail if the actual numexpr call wasn't working. In other words, it would stil be getting the right numerical result, but it wouldn't use the speedup from numexpr.

jreback · 2013-06-06T00:37:16Z

put in some temporary print statements around the actual call to ne.evaluate (I think I may have had them at one time, but took them out)

jreback · 2013-06-06T00:38:50Z

FYI...you might be interested in the long discussion about #3393, ne going to get even heavier use going forward

jtratner · 2013-06-06T00:55:41Z

@jreback just to be clear, what I mean is that if you set pandas.core.expressions to only use _evaluate_standard and _where_standard, none of the tests in test_expressions.py fail. Similarly, if you add a ValueError in the middle of _evaluate_numexpr that causes it to always fall back to _evaluate_standard, all the tests pass. Any ideas on how to enforce it using numexpr (or maybe that would come out with performance tests?) Just wondering if you think this is a problem.

def _evaluate_numexpr(op, op_str, a, b, raise_on_error = False, **eval_kwargs):
    result = None

    if _can_use_numexpr(op, op_str, a, b, 'evaluate'):
        try:
            a_value, b_value = a, b
            if hasattr(a_value,'values'):
                a_value = a_value.values
            if hasattr(b_value,'values'):
                b_value = b_value.values
            # This causes fallback to _evaluate_standard
            raise ValueError("unknown type object")
            result = ne.evaluate('a_value %s b_value' % op_str, 
                                 local_dict={ 'a_value' : a_value, 
                                              'b_value' : b_value }, 
                                 casting='safe', **eval_kwargs)
        except (ValueError), detail:
            if 'unknown type object' in str(detail):
                pass
        except (Exception), detail:
            if raise_on_error:
                raise TypeError(str(detail))

    if result is None:
        result = _evaluate_standard(op,op_str,a,b,raise_on_error)

    return result

jreback · 2013-06-06T01:02:36Z

@jtratner

2 different issues here

numexpr is technically optional, so we always need to be able to fall back if numexpr is not installed, also fallback if its not efficient to use it, or the dtypes are wrong
tests, there are the set_use_numexpr to explicity make it use, this is for testing (in a similar vein, the set_numexpr_threads is for perf testing (see the vb_suite)) - set_use_numexpr will explicity set/reset _evaluate to point to the right ones (which can actually STILL fallback if there is an exception in numexpr, always have to guard against that)

answer your question?

jtratner · 2013-06-06T01:13:29Z

Thanks for digging into this with me - I appreciate you taking the time to
explain!

I get what you're saying and it makes sense. However, test_expressions
doesn't even get run unless numexpr is installed, so you can assert
within that module that everything ought to be working correctly. What I'm
wondering is how you test for the specific case where numexpr is
installed but panda's way of calling it fails. _evalute_numexpr turns
everything into a TypeError which gets caught and handled by
_arith_method, so I'm not sure how you check for that to make sure pandas
is integrating correctly with numexpr.

On Wed, Jun 5, 2013 at 9:02 PM, jreback [email protected] wrote:

@jtratner https://github.com/jtratner

2 different issues here

numexpr is technically optional, so we always need to be able to
fall back if numexpr is not installed, also fallback if its not efficient
to use it, or the dtypes are wrong

tests, there are the set_use_numexpr to explicity make it use, this is
for testing (in a similar vein, the set_numexpr_threads is for perf
testing (see the vb_suite)) - set_use_numexpr will explicity set/reset
_evaluate to point to the right ones (which can actually STILL fallback
if there is an exception in numexpr, always have to guard against that)

answer your question?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3764#issuecomment-19019172
.

jreback · 2013-06-06T01:41:22Z

numexpr rarely raises itself (mainly on a type issue, which we catch), or the catchall (which Exception), I think if some other kind of error, don't remember. We only bail on the catchall

otherwise it is computed as normal (which can itself raise an exception, but only a weird case I think)

so its not tested per se that calling from a data frame works (and uses numexpr when it is installed).

I suppose you could add some functions to expressions which would set/unset a module level flag for the last computation done (and maybe status info, would have to sort of 'record' which path actually is taken)

that way you could do something like this:

from pandas.core import expressions as e
e.set_test_mode(T)
result = big_df + 1
test_result = e.get_test_results()

and get_test_mode can return (and invalid the module level data which set_test_mode sets)

and test_result could be say:

test_result = dict(computation = ...., result = ...., path = ....)

whatever

jtratner · 2013-06-06T02:11:54Z

Sounds like it's not really worth the time/overhead to check though.

jreback · 2013-06-06T02:14:04Z

what do u mean?

jtratner · 2013-06-06T02:20:53Z

Well, as in: would it slow things down much to check for a global variable each time? do you think it's actually worth checking for?

jreback · 2013-06-06T02:24:14Z

this was just for testing

you would have a single check inside evaluate that would set the test result

you could actually set the status each time but you don't want to store the result at all because of he memory refs

easy enough to store the passed string and and the status

jtratner · 2013-06-08T21:56:21Z

I want to leave the test to check that numexpr is actually used to be included with the #3765 , since that will probably turn into a more general check that has to happen for all expressions tests. Anything else I need to do before this can be merged? (Hopefully this gets into 0.11.1, since it reflects a numerical bug, rather than performance bug).

jtratner · 2013-06-09T02:34:32Z

Update on this: everything passes now and I added all the standard arithmetic operators to the test cases. This exposed a regression where __mod__ was given a str_rep of * instead of %. Unfortunately, I kept getting floating point exceptions when I ran the tests for __mod__ accelerated by numexpr, so I just removed str_rep so it will use numpy's operator.mod instead. (I think this was added on 5/14, so doesn't need to go in RELEASE.rst)

Accelerating // caused an inconsistency too (though I don't think it was actually set up) where, when dividing by zero, numpy returns 0.0000 whereas numexpr evaluates to inf. Not sure which is correct, so it's also not included.

In summary, this pull request adds test cases for expressions.py for all major operators and tests Series (which isn't currently accelerated) and Frame with >10,000 cells. It also fixes __truediv__ to actually do truedivision when evaluated under numexpr (previously was doing integer/floor division) and fixes __mod__ by removing the incorrect str_rep being passed to numexpr.

jreback · 2013-06-09T02:49:58Z

@jtratner see here: http://pandas.pydata.org/pandas-docs/dev/whatsnew.html (the module/div section). I believe numpy is wrong in returning 0.0000, but numexpr is correct in return np.inf; that's what I 'fixed' pandas to do (float division is correct on both numpy/ne for all cases), but for some reason numpy does weird things with ints....so just make its consistent....

nice job on this...

jtratner · 2013-06-09T02:57:22Z

Thanks!

On mod/floordiv - can you show me where you implemented that fix? is that
with the com._fill_zeros?

jreback · 2013-06-09T03:04:14Z

exactly

I can be reached on my cell 917-971-6387

On Jun 8, 2013, at 10:57 PM, Jeff Tratner [email protected] wrote:

Thanks!

On mod/floordiv - can you show me where you implemented that fix? is that
with the com._fill_zeros?

On Sat, Jun 8, 2013 at 10:50 PM, jreback [email protected] wrote:

@jtratner https://github.com/jtratner see here:
http://pandas.pydata.org/pandas-docs/dev/whatsnew.html (the module/div
section). I believe numpy is wrong in returning 0.0000, but numexpr is
correct in return np.inf; that's what I 'fixed' pandas to do (float
division is correct on both numpy/ne for all cases), but for some reason
numpy does weird things with ints....so just make its consistent....

nice job on this...

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3764#issuecomment-19159796
.

—
Reply to this email directly or view it on GitHub.

jtratner · 2013-06-09T03:34:21Z

I think I figured out the issue...didn't include the explicit requirement to fill zeros on all...will fix soon.

jtratner · 2013-06-09T03:35:29Z

Also, any reason why __mod__ gets filled with nan?

    __mod__ = _arith_method(operator.mod, '__mod__', default_axis=None, fill_zeros=np.nan)

jreback · 2013-06-09T03:45:48Z

I was consistent with how numpy treats floats
so mod was filled with nan, div with inf
(as you can see the integers are screwed up, that's
the reason we did it)

In [10]: np.array([1.0,2.0,3.0]) % 0
Out[10]: array([ nan,  nan,  nan])

In [11]: np.array([1,2,3]) % 0
Out[11]: array([0, 0, 0])

jtratner · 2013-06-09T04:13:45Z

Okay, this should resolve everything. I have a separate raft of commits that adds truediv and floordiv to all the containers as well as fixes up fill_zeros (which is still not consistent here). If it's okay with you, I'd rather handle all of those separately, because they make a logical group (and also because the multiple changes makes it harder to cherry-pick frame fixes to here).

The key thing was to move the com._fill_zeros call outside of the try/catch because it needed to apply to everything even when numexpr didn't evaluate it to be fully consistent.

jreback · 2013-06-09T04:19:02Z

ok by me

jreback · 2013-06-09T16:15:41Z

@jtratner this pr doesn't actually change the default to truediv right? just makes ne do it if its already enabled

otherwise will need a big API warning

jtratner · 2013-06-09T18:05:58Z

No, div explicitly only does floor division (which is what it was doing before this change, since the lambda x, y: x / y was always executed in the non-from __future__ import division environment). I also didn't change the Python 3 version (where div does true division). It doesn't change the behavior from before (only fixes the __truediv__ and __mod__ methods to not be wrong).

From Python 2 --> Python 3 the behavior of div changes, but that was already the case beforehand.

jtratner · 2013-06-13T21:07:20Z

This is failing because of issue in #3814 (I think). But I do think it should be incorporated for 0.11.1, given that it fixes a bug (__truediv__ actually doing floor division with two integer dataframes using numexpr). Doesn't change any behavior (except fixing the broken behavior) nor does it change public API of anything.

Only add 'div' if not python 3 Also checks dtype in test cases for series

by passing truediv=True to evaluate for __truediv__ and truediv=False for __div__.

`%` (modulus) intermittently causes floating point exceptions when used with numexpr. `//` (floordiv) is inconsistent between unaccelerated and accelerated when dividing by zero (unaccelerated produces 0.0000, accelerated produces `inf`)

jtratner · 2013-06-18T04:07:13Z

@jreback okay, this is definitely ready to go. It doesn't change anything API-wise, except for fixing a bug with using true division with integer arrays of greater than 10,000 cells with numexpr installed. (And we know it's a bug because it means that we'd end up with different output than numpy arrays, etc.)

BUG: Fix __truediv__ numexpr error

jreback · 2013-06-18T13:01:34Z

thanks @jtratner your contribs are great!

keep em comin!

jtratner · 2013-06-18T13:13:14Z

thanks :)

On Tue, Jun 18, 2013 at 9:01 AM, jreback [email protected] wrote:

thanks @jtratner https://github.com/jtratner your contribs are great!

keep em comin!

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3764#issuecomment-19609398
.

nipunbatra mentioned this pull request Jun 5, 2013

read_csv epoch parsing with pd.to_datetime() #3757

Closed

jtratner mentioned this pull request Jun 15, 2013

PERF: have series/panel arithmetic operators use expressions (numexpr) #3765

Closed

jtratner added 5 commits June 17, 2013 22:17

TST: Add numexpr arithmetic tests.

273946a

Only add 'div' if not python 3 Also checks dtype in test cases for series

ENH: Allow evaluate to pass kwargs to numexpr

ae8b7ab

BUG: Fix __truediv__ numexpr error

5d69a4d

by passing truediv=True to evaluate for __truediv__ and truediv=False for __div__.

TST: Make test_expressions cover more arithmetic operators

0e7781c

jreback added a commit that referenced this pull request Jun 18, 2013

Merge pull request #3764 from jtratner/fix_division_with_numexpr

da14c6e

BUG: Fix __truediv__ numexpr error

jreback merged commit da14c6e into pandas-dev:master Jun 18, 2013

jtratner deleted the fix_division_with_numexpr branch September 21, 2013 20:12

BUG: Fix __truediv__ numexpr error #3764

BUG: Fix __truediv__ numexpr error #3764

Conversation

jtratner commented Jun 5, 2013

jreback commented Jun 5, 2013

jreback commented Jun 5, 2013

jreback commented Jun 5, 2013

jtratner commented Jun 5, 2013

jreback commented Jun 6, 2013

jtratner commented Jun 6, 2013

jtratner commented Jun 6, 2013

jreback commented Jun 6, 2013

jreback commented Jun 6, 2013

jtratner commented Jun 6, 2013

jreback commented Jun 6, 2013

jtratner commented Jun 6, 2013

jreback commented Jun 6, 2013

jtratner commented Jun 6, 2013

jreback commented Jun 6, 2013

jreback commented Jun 6, 2013

jtratner commented Jun 6, 2013

jreback commented Jun 6, 2013

jtratner commented Jun 6, 2013

jreback commented Jun 6, 2013

jtratner commented Jun 6, 2013

jreback commented Jun 6, 2013

jtratner commented Jun 6, 2013

jreback commented Jun 6, 2013

jtratner commented Jun 8, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

jtratner commented Jun 9, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

jreback commented Jun 9, 2013

jtratner commented Jun 9, 2013

jtratner commented Jun 13, 2013

jtratner commented Jun 18, 2013

jreback commented Jun 18, 2013

jtratner commented Jun 18, 2013

BUG: Fix truediv numexpr error #3764

BUG: Fix truediv numexpr error #3764