Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: eval function #3393

Closed
jreback opened this issue Apr 18, 2013 · 298 comments · Fixed by #4162 or #4164
Closed

ENH: eval function #3393

jreback opened this issue Apr 18, 2013 · 298 comments · Fixed by #4162 or #4164
Labels
Enhancement Ideas Long-Term Enhancement Discussions Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Apr 18, 2013

Provide a top-level eval function, something like:

pd.eval(function_or_string,method=None, **kwargs)

to support things like:

  1. out-of-core computation (locally) (see ENH: create out-of-core processing module #3202)

  2. string evaluation which does not use python lazy evaluation
    (so pandas can process effiiently)

pd.eval('df + df2',method='numexpr') (or maybe default to numexpr)

see also:
http://stackoverflow.com/questions/16527491/python-perform-operation-in-string

  1. possible out-of-pandas evaluation
    pd.eval('df + df2',method='picloud')
    http://www.picloud.com/ (though they seem to have really old versions of pandas),
    but I think they handle it anyhow
@hayd
Copy link
Contributor

hayd commented May 19, 2013

To follow on from your comment, we shouldn't we be using & and |? I think this may also have the benefit of all and any just working.

Also ~ for not/invert (since would make it the same as numpy).

I haven't got my head around numexpr yet, so I may be talking complete nonsense. (I've moved Term to expressions without breaking things, and changed the repr to eval back to itself (was there a reason for it not?).

@jreback
Copy link
Contributor Author

jreback commented May 19, 2013

I agree about the operators (though I think you actually need to accept both), these are always going to be in a string expression in any event....because you need delayed evaluation
but since we actually DO want the & etc...you can just replace them (e.g. this is really a user interface issue), we are not actually going to evaluate them

e.g.

df[(df > 0) and (df < 10)]

vs

df['(df > 0) and (df < 10)]']

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

i'm sure everyone involved in this thread knows this but just wanted to point out that the precedence of and and & is different. if i was a first time user i would think that df > 0 and df < 10 and df > 0 & df < 10 do the same thing, so if both are going to be supported i think precedence rules should be kept as close to python as possible meaning parens are required for & but not for and.

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

@cpcloud this is in a string context, so in theory you can make them the same (as this is a big confusion I think in operating in pandas, I think people expect them to be the same (even though they are wrong)

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

@jreback sure. i was just semi-thinking-out-loud here, thought that it might warrant a discussion. this goes back to python core devs not wanting to provide the ability to overload not and and or so numpy was forced to overload bitwise operators for boolean operations (there's a youtube video of a discussion about this with GVR, there's even a patch to core python that allows you to do this). i really wish that pep went through sigh. i didn't realize there was a big confusion here, since this really has nothing to do with pandas, it's a language feature/bug. i was just thinking that adding more parsing rules to remember is annoying to users.

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

it's a valid point

the purpose of eval is to facilitate multi expression parsing that we will evaluate in Numexpr
so we have to have a string input (to avoid python evaluation of the sub expressions)
or maybe there is a way to disable this (like how %timeit works in ipython)
but i think they r using an input hook and hence everything really is a string

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

@jreback u can do it with the cmd module too. i think ipython used to use that or maybe they still do. i think only macros would allow you do this without string literals. btw there is now a Python macros library. i haven't tried it out but it looks like fun. another possibility is to support numba as a method although first things first (numexpr). do u already have something going for this?

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

@hayd said he was giving a stab
Andy can u post a link to your branch?

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

@cpcloud numba is interesting, but the infrastructure requirement is high, and in any event, its basically using numexpr under the hood :) (as well as ctable for storage)

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

@cpcloud I reread your question

the issue is this: df[(df>0) & (df<10)] is evaluated as 3 separate sub-expressions, plus a boolean mask selection

while

df.eval('(df>0) & (df<10)') can be evaluated (after alignment) in a single numexpr pass (and then a boolean mask) to return the dataframe, so can be a massive speedup

that's the main reason for this function

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

@jreback that is pretty cool. i haven't done much with numexpr, i assumed that pandas uses it when it can...is that a fallacious assumption? should i be explicitly using numexpr?

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

it's used in pytables for query processing
and in most evaluations now as of 0.11
(you need a fairly big frame for it to matter)

see the core/expressions module

@hayd
Copy link
Contributor

hayd commented May 24, 2013

I haven't done much so far, I've moved Term to expressions and added some helper functions for that class, not have I really looked in to numexp yet.

I kind of lost my way on the road map... and may be totally confused atm.

Am I way off here?

1. move term to expression
2. create class for "termset" (not sure what name, I was thinking this would be a list (possibly of termsets) with a flag whether it was all/any).
3. work out how to process termsets strings numpexp (is this the tricky part?)
4. create method for "termset" to strings which can be processed to numexp e.g.
5. create parser for our DSL to termset e.g. '(df>0) & (df<10)' -> [Term(df, '>', 0), Term(df, '<', 10)]

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

so there are 3 goals here:

  1. parser to turn:
 'df[(df>0)&(df<10)]'

into this (call this the parsed_expr)

df[_And(Term('df','>',0),Term('df','<','10'))]
  1. take a parsed_expr, align the variables (e.g. perform stuff like what combine_frame, combine_series, combine_scalar does (e.g. the alignment/reindexing steps), call this the aligned_expr

  2. take aligned_expr and turn this into a numexpr expression (like what Term does and the expressions module does (though its very simple), this would be an exapansion of expresssions to take in an aligned Terms with their boolean operators (e.g. _And/_Or/_Not and parens)

  3. involves tokenizing/ast manip (kind of like numexpr.evaluate does) to form the Terms; I am not sure how tricky this is, so we were going to skip for now

  4. this is straightforward: take the parsed_expr and substitue variables that are aligned (keep frames as frames), don't need to exapand scalars at allow, mainly just reindex things that need, create the aligned_expr

  5. this is straightforward too, just take the term expressions and generate the numexpr itself

so I think termset is really Term, plus the boolean operators, and a grouping operator (the parens)
these just allow easy expression manip (your 2)
your 3 (skip for now, that's my 1)

your 4 is my 3

I don't think you need 5

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

@jreback i know u said skip 1 but i can do that if u want (lots of nice python builtins for dealing with python source) while @hayd does 2 and 3. what would be allowed to be parsed? exprs in the python grammar? or just boolean exprs? could start with booleans fornow and extend after that is working...

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

the more there merrier!

let's start with the example

df.eval('(df>0)&(df<10)')

This is really about the masks as that's where all the work is done

but I think it would be nice evenutally to do something on the rhs as well:

pd.eval('df[(df>0)&(df<10)] = df * 2 + 1', engine='numexpr')

so we can support getitem and setitem and pass both the lhs and rhs to the evaluator

(imagine engine = 'multi-process' or 'out-of-core')......

to the heck with blaze! (well maybe engine=blaze is ok too)

@hayd
Copy link
Contributor

hayd commented May 24, 2013

I think I was worried that nested Terms wouldn't come for free with _And and _Or, but I'll put something together imminently and we can see whether it does. :)

@hayd
Copy link
Contributor

hayd commented May 24, 2013

We can just tell everyone it's blaze...

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

i've got it parsing nested and terms already :)

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

albeit they are strings right now and only & (parsing and is different), i haven't written the _And class yet

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

@cpcloud I would just use the &, |, and ~ for now (to keep consistent), can always add later

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

@cpcloud

the end goal is to create a numexpr expression (the functionaility is in the Selection class in io/pytables.py); so the class that holds the parses expression (the nested _And/_Or) should parse to this (and has to do type translation and such), also this class could do the alignment I think (which is the reason for having the parsed expression, so you can basically just iterate thru all of the terms and see what needs to be aligned)

e.g.

for t in term_expression:
      t.align()

Term align (pseudo codish)

def align(self):
      self.lhs
      self.op
      self.rhs

      if self.lhs ia DataFrame:
           if self.rhs is a Series....
                     is a Frame
                     is a Scalar

maybe return a new expression that is aligned

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

ah i see. so an Expr class should hold the ands and ors which consist of terms (or nested expressions). Expr could have an align method which aligns and then passes to numexpr. is that correct?

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

I think you actually need 3 classes here:

  1. Term which holds lhs operator rhs (and prob a reference back to the top-level Expr for variable inference 2)Termset, although maybeExpr, or maybeTerms? is better here (I mean a nested list of_and,_or,_notoperators on theTerms)
  2. Top-level, maybe Expression, which holds 2) the termset, and the engine and such

e.g.

pd.eval('df[(df>0)&(df<10)'])

yields

Expression():
    original string
    df[mask] (you need to keep this where)
    termset of the boolean expression
    engine
    maybe an environment points (this is like a closure) but we are not fancy here :)

    methods:
        parse (create the termset)
        align (have the termset align)
        convert_to_engine_format (return the converted termset)
Termset():
     _and(Term('df','>',0),Term('df','<',10))
     methods:
          align (maybe return a new termset that is aligned)
          convert_to_engine_format (return the converted to engine format,
              this would be a string)

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

lol gh doesn't like ur rst flavored monospace

@hayd
Copy link
Contributor

hayd commented May 24, 2013

This was where I was up to: https://github.com/hayd/pandas/tree/term-refactor

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

possible engines right now are 'numexpr' and 'pytables'?

@jreback
Copy link
Contributor Author

jreback commented May 24, 2013

well....pytables target is the same, numexpr, only difference is that the Terms need to do different alignment (as they are scalar type conditions, e.g. index>20130523, where index is a field in the table, and the date gets translated to i8; so do need support for that (so yes you could use engine=pytables) to handle that, but in pytables need to have what I call the queryables dict passed in anyhow for validation (whereas in the case of a boolean expression you have the df passed in) (or taken from the locals())

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

@jreback @hayd fyi for some reason expressions.py has dos line endings while, for example, frame.py does not. isn't git supposed to take of this? it's pretty annoying and will cause a billion and one merge conflicts...it's just that file: i just ran dos2unix on all of pandas and that's the only thing changed. i did this after a fresh clone

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

oh that is nice. still have the issue of the different behav tho

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

interesting and possibly alarming bit....

In [53]: df = DataFrame(randn(10000000, 10))

In [54]: df2 = DataFrame(randn(*df.shape))

In [55]: df3 = DataFrame(randn(*df.shape))

In [56]: s = 'df + df2 * 2 + df3 ** 2 * df * df + df2 ** 40'

In [57]: res = pd.eval(s)

In [58]: res2 = df + df2 * 2 + df3 ** 2 * df * df + df2 ** 40

In [59]: norm(res - res2)
Out[59]: 2411155374342516.5

In [60]: allclose(res, res2)
Out[60]: True

i'm guessing this is because of the large power term and because the arrays are big, but i don't see why the L2 norm should be that different (order of magnitude of difference is 10 ** 15). the L2 norm difference is much smaller with a power only half as big a basically disappears below this value.

In [63]: s = 'df + df2 * 2 + df3 ** 2 * df * df + df2 ** 20'

In [64]: res = pd.eval(s)

In [65]: res2 = df + df2 * 2 + df3 ** 2 * df * df + df2 ** 20
nor
In [66]: norm(res - res2)
Out[66]: 1.4282464805728339

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

ideally the norm should be 0

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

i think numexpr might be unrolling integer power ops

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

well i don't think it's a bug in eval since this easily replicable with straight numpy/numexpr

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

k not "dumb" loop unrolling maybe there is some other optimization technique or this is just a straight up bug

In [55]: df = DataFrame(randn(10000000, 10))

In [56]: x = df.values

In [57]: norm(ne.evaluate('x ** 40') - ne.evaluate(' * '.join(['x'] * 40)))
Out[57]: 3966006570425040.5

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

powernorm

@jreback
Copy link
Contributor Author

jreback commented Jun 16, 2013

maybe some sort of overflow?

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

i believe the optimization is the cause of the divergence:

In [16]: x = randn(10000000, 10)

In [17]: norm(ne.evaluate('x ** 40', optimization='none') - ne.evaluate('x ** 40'))
Out[17]: 616872973144280.12

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

@jreback @jtratner cast % nodes to float64? i think raising on floordiv with a message saying "pass engine=python or eval in python if you want to use floor division", thoughts?

@jreback
Copy link
Contributor Author

jreback commented Jun 16, 2013

I would just cast them

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

cast mod u mean right? can't really cast floordiv result as that would defeat the purpose of this...

@jreback
Copy link
Contributor Author

jreback commented Jun 16, 2013

yes

@jreback
Copy link
Contributor Author

jreback commented Jun 22, 2013

http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html

would be a good place for docs on eval

@jtratner
Copy link
Contributor

I think it makes sense to cast to float, very simple to get back
On Jun 21, 2013 8:46 PM, "jreback" [email protected] wrote:

http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html

would be a good place for docs on eval


Reply to this email directly or view it on GitHubhttps://github.com//issues/3393#issuecomment-19847860
.

@cpcloud
Copy link
Member

cpcloud commented Jun 22, 2013

yep this required a pretty large refactoring since to cast in a general way the op needs to know about the scope of its operands

@cpcloud
Copy link
Member

cpcloud commented Jun 22, 2013

must cast recursively down the parse tree

@cpcloud
Copy link
Member

cpcloud commented Jun 22, 2013

so that on eval the correct cast is performed...this will work unless there's floor division on both sides, but in that case you shouldn't be using eval anyway since that will run only on the python engine

in other news...implementing an operator in numexpr is not trivial...i thought about doing it but it's kind of a beast...maybe i will anyway

@cpcloud
Copy link
Member

cpcloud commented Jun 22, 2013

eval is useful for two things as stated above:

  1. there's now a parser for basic arithmetic expressions useful for manip prior to perform ops
  2. 9-10x-ish speedup for long expressions containing huge frames that don't need alignment

@cpcloud
Copy link
Member

cpcloud commented Jun 25, 2013

@jreback is it intentional that df % series == series % df? E.g.,

In [20]: df = DataFrame(randn(10,10))

In [21]: df
Out[21]:
       0      1      2      3      4      5      6      7      8      9
0 -1.147 -0.175  0.867 -0.459 -0.751 -0.822 -0.927 -1.572  0.813 -0.558
1 -0.043 -1.416  1.420 -1.243 -0.656  0.726 -0.408  0.545 -2.712 -0.353
2  0.566  0.489  1.528  3.058  1.393  1.282 -0.276 -0.705 -0.183  1.386
3  0.679 -0.082  0.831 -2.167 -1.347 -0.178 -0.812 -0.465 -1.509 -0.337
4  0.031  0.975 -1.157 -0.613 -0.491 -0.478  1.763 -0.328 -0.897 -0.011
5  1.110 -0.088  0.162  0.061 -0.715  1.214  1.188  1.802 -0.841  1.435
6 -1.063 -0.447 -0.743 -0.567  1.492  0.468  2.043 -0.873 -0.803 -1.178
7  0.507 -2.446 -1.553  0.468 -0.148 -0.871 -0.207  1.386 -1.173  0.155
8 -1.267 -0.219 -0.021 -0.686  0.159 -0.868 -1.734  0.312 -1.460  0.864
9  1.800  0.751 -0.677 -2.029 -0.711 -0.748  0.555  1.060  0.493  1.842

In [22]: s = df[0]

In [23]: s
Out[23]:
0   -1.147
1   -0.043
2    0.566
3    0.679
4    0.031
5    1.110
6   -1.063
7    0.507
8   -1.267
9    1.800
Name: 0, dtype: float64

In [24]: allclose(df % s, s % df)
Out[24]: True

In [25]: allclose(df.values % s.values, s.values % df.values)
Out[25]: False

@cpcloud
Copy link
Member

cpcloud commented Jun 25, 2013

basically force the frame on the lhs of modulus is what's happening

@cpcloud
Copy link
Member

cpcloud commented Jun 25, 2013

i'll submit a pr to fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Ideas Long-Term Enhancement Discussions Performance Memory or execution speed performance
Projects
None yet
5 participants