Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Numba as an optional dependency for rolling.apply for pandas 1.0 #28987

Closed
mroeschke opened this issue Oct 15, 2019 · 19 comments · Fixed by #30151
Closed

Add Numba as an optional dependency for rolling.apply for pandas 1.0 #28987

mroeschke opened this issue Oct 15, 2019 · 19 comments · Fixed by #30151
Labels
Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding
Milestone

Comments

@mroeschke
Copy link
Member

As mentioned on the pandas dev call last week, I've been working with @jreback and @DiegoAlbertoTorres on a proof of concept (POC) implementing rolling.mean and rolling.apply using Numba instead of our current Cython implementation. As described in this proof of concept document, we worked on:

  1. Refactoring window bound calculation and aggregation to use Numba
  2. Developing a new API for users to implement their own window bounds calculations (df.rolling(MyWindowerClass()).mean())

The document details performance results, high level implementation details and integration plan into pandas. The fork is up to date with master and is running the same CI checks.

The proposal for pandas 1.0 is:

1) Introduce Numba as a required pandas dependency
2) Have rolling.mean dispatch to the Numba implementation

The proposal for post pandas 1.0 is:

  1. Expose the new API for users to implement their own window bounds calculations
  2. Implement all rolling aggregations (min, max, count, etc,) in Numba
  3. Implement EWM and Expanding in Numba
  4. Generalize data grouping (groupby, rolling, resample) and aggregations (mean, max, etc) using Numba jitclasses

As the issue title notes, I'm hoping to keep this discussion focused around any concerns, thoughts, suggestions regarding the POC and how it pertains to the pandas 1.0 proposal, but feel free to ask any questions regarding topics the POC doesn't cover.

cc @pandas-dev/pandas-core

@mroeschke mroeschke added Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding labels Oct 15, 2019
@rgommers
Copy link
Contributor

Hi all. This proposal has a lot of implications that I don't see addressed in the PoC document. I would suggest reviewing this thread about introducing Numba as a SciPy dependency from last year.

Some concrete questions that may be showstoppers are:

  • what is the status on ARM and other more exotic hardware (e.g. AIX, the Debian-supported architectures)?
  • what's the impact on deployment size (e.g. this may hurt embedded use cases, and push NumPy+SciPy+Pandas over the size limit for AWS Lambda).

Pydata/Sparse, which is pure Python + Numba, is thinking about incrementally moving parts back to compiled code, due to stability and currently missing features in Numba.

And I think that Scikit-learn experimented with this fairly recently and also came away with "not ready yet" (second-hand info via @amueller, please correct me if I'm wrong @NicolasHug).

@shoyer
Copy link
Member

shoyer commented Oct 15, 2019

I’m sympathetic to the goal here, but my sense is that Numba can still be a challenging project to redistribute, so I’m not sure this is a good idea. Not everyone uses conda or pip. Just look at the building from source instructions for patching LLVM, for example: https://llvmlite.readthedocs.io/en/latest/admin-guide/install.html.

A strength of pandas has been that it’s easy to install with a standard toolchain, and has a minimal set of required dependencies.

In your linked doc, you list three benefits from using Numba:

  1. Performance parity or improvement over Cython
  2. Eliminate shipping C-extensions
  3. Ease of debugging and maintenance as pure Python code

With all due respect, I question benefits (2) and (3).

Removing all C extensions would be to remove the need for a C compiler when building pandas from source, but this seems quite farfetched for pandas. Until that happens, I’m not sure why this would be productive. C extensions are a mature and stable technology.

Calling Numba wrapped code “pure Python code” from a debugging perspective also seems misguided. It’s only pure Python until you compile it with Numba, at which point you can debug Numba’s compiler and LLVM IR. I’m pretty confident that the toolchain for debugging Cython’s generated C code is much more battle tested. For example, Cython is actually capable of embedding arbitrary Python, while Numba requires dropping back into “object” mode, which as far as I understand entails skipping most of the compiler.

@toobaz
Copy link
Member

toobaz commented Oct 15, 2019

As mentioned in the last dev call, my (very limited) experience with numba suggests that it is very easy (= few lines of decorator) to use it without hard depending on it, and leaving the user with the pure Python implementation if numba is missing. While such pure Python implementation would be very slow, this would affect only rolling functions which are not necessarily used by all users (at that time, we would probably emit a warning). In any case, while I would not impose the burden of always writing a numba and a numpy implementation of rolling functions, this does not mean we have to kill existing numpy implementations (as a fallback) in 1.0.

So if we decide to go ahead introducing numba in 1.0 (and on this, other people have better judgement than me), I still would refrain, if possible, from introducing the hard dependency.

@gfyoung
Copy link
Member

gfyoung commented Oct 15, 2019

I still would refrain, if possible, from introducing the hard dependency.

If you're going to use numba, I think you should either go "all in" or not at all. The toolchain implications of using numba are not trivial, and the fallback is not ideal with pure Python code.

this would affect only rolling functions which are not necessarily used by all users

Sure, but that doesn't mean we should sacrifice performance. Ideally, there should be a non-negative benefit for all users if we made such a switch.

@NicolasHug
Copy link
Contributor

And I think that Scikit-learn experimented with this fairly recently and also came away with "not ready yet" (second-hand info via @amueller, please correct me if I'm wrong @NicolasHug).

Yes, I don't think Numba is mature enough for scikit-learn to switch from Cython.

I've used numba extensively in pygbm and more recently in a toy project (hmmkay). Here are the main arguments I have against it:

  1. error messages aren't helpful (at all)
  2. compilation time is significant for a large code-base (caching makes it less of an issue but it can be surprising for users)
  3. it's fairly unstable compared to cython. Bugs can be extremely hard to catch, and it takes time to realize that the bug comes from numba, and not from your code. I literally wasted days with pygbm on this.
  4. API is not entirely stable yet (e.g. deprecation of reflected lists for typed lists)
  5. A few features are missing, like being able to parallelize methods. It doesn't look like much, but it makes the code look like spaghetti after a while.
  6. It's magic, so it's hard to know whether what you are writing is actually optimal or not. This is also true for Cython but I find numba even more surprising.

With all this I tend to question the "ease of use" argument of numba, I don't find it to be true, for now.

Regarding allowing a soft dependency on Numba: numba code is Python code, but it's not pythonic code. For numba to generate efficient code you need to unfold for loops, etc. Yes this is pure python code, but it will be incredibly slow if numba does not JIT compile it (slower than if you had been writing pythonic code).

@WillAyd
Copy link
Member

WillAyd commented Oct 15, 2019

With one of the major concerns in this thread being the installation dependency, is there a comprehensive listing of what platforms this would impact? I'm not sure we've been explicit ourselves as to what platforms are supported; if it is a complicating factor for niche platforms I don't think that should be a showstopper

FWIW I think it's definitely worth exploring. The performance improvements on lambda expressions are great, and while we already have a lot of Cython architecture in place I would argue that we may be biased in assessing Cython as easier to debug than Numba. I don't find that we get a lot of newcomers that jump into the Cython code, nor is our Cython code always clean or consistent. I could see Numba potentially being easier to digest

@jbrockmendel
Copy link
Member

jbrockmendel commented Oct 15, 2019 via email

@shoyer
Copy link
Member

shoyer commented Oct 15, 2019

A lot of the conversation has been about numba vs cython, but we also have
a soft dependency on bottleneck that is used in core.nanops that we might
be able to remove if we were to use numba. IIRC @shoyer did a proof of
concept for switching over some of those functions?

It is straightforward to rewrite most of the algorithms in bottleneck in Numba (e.g., like I did in http://github.com/shoyer/numbagg), but also in Cython or C++. If somebody cares about removing the bottleneck dependency, there are any number of fine ways to do so.

Actually C++ is really the elephant in the room here. It is definitely the most mature technology for high performance, high level programming.

bottleneck's last release was May 2017 and there are some bugs e.g. #22385 that don't seem to have any movement

Bottleneck has a new maintainer and recently moved over in the pydata GitHub organization.

I don't think it's fair to consider #22385 a bug. Bottleneck doesn't implement pairwise summation, but that doesn't make it wrong, it just a slightly less numerically stable algorithm for computing sums.

@jorisvandenbossche
Copy link
Member

There are two goals mentioned in the top post: 1) use numba to implement the built-in rolling operations and 2) provide an API for the user to pass a numba function to be used as rolling operation.

I think the second goal is also very useful (and for the user give more added value, as the first goal is more an internal change). Could this second goal be implemented without having numba as a hard dependency?

@stuartarchibald
Copy link

Hi, Numba core developer here,

We agree that there are concerns in using Numba on a project like Pandas, and we would like to prioritise them better. It has been helpful in the past to itemise them and obtain clarity over the issues such that they can be fixed/improved (xref: numba/numba#2888 for the last round of this). I've opened a new issue to start a new discussion over on the Numba issue tracker (xref: numba/numba#4723).

Thanks!

@mroeschke
Copy link
Member Author

mroeschke commented Oct 18, 2019

Thanks for all the feedback @pandas-dev/pandas-core

I'd like to revise my pandas 1.0 proposal to the following:

  1. Introduce Numba as an optional dependency
  2. Have rolling.apply dispatch to Numba as opt-in behavior (e.g. rolling(...).apply(..., engine='Numba')) as a potential significant performance boost

The main allure of Numba in its immediate state is being able to write Python and achieve comparable performance with Cython (or C++) thus providing:

  1. Lower barrier to entry for new contributors.
  2. More debuggable workflow for the maintance team, e.g. @jit(forceobj=True) to drop back to Python.
    2a) Debugging the LLVM IR has been more of a matter of debugging performance instead of result correctness in my experience; the later which is more important.

A con of adding Numba as a optional dependency for rolling is having to maintain duplicate implementations in Cython and Numba. The burden should go down in the long term as we re-visit Numba as a core dependency and circle back to a single implementation.

While a lot of the concerns are around Numba's immaturity, it is a consideration for limiting its application in pandas 1.0 (to only rolling.apply).
Moreover, I believe adopting Numba as an optional dependency can help drive significant feedback to the project (platform support, feature completeness, etc)

@jreback jreback added this to the 1.0 milestone Oct 18, 2019
@jorisvandenbossche
Copy link
Member

A con of adding Numba as a optional dependency for rolling is having to maintain duplicate implementations in Cython and Numba.

Why is this? Would you still include implementations of all built-in rolling functions in pandas? I would think the .apply(func, engine='numba') is mainly for user-defined-functions?

@jreback
Copy link
Contributor

jreback commented Oct 18, 2019

A con of adding Numba as a optional dependency for rolling is having to maintain duplicate implementations in Cython and Numba.

Why is this? Would you still include implementations of all built-in rolling functions in pandas? I would think the .apply(func, engine='numba') is mainly for user-defined-functions?

no, you wouldn't need to (at this point), duplicate anything for building in rolling function (e.g. mean); the duplication of code is in the computation of the Fixed and Variable window indexers themselves (not a huge duplication).

@mroeschke
Copy link
Member Author

Right, I think we'll need 2 implementations for calculating the window boundaries, a Cython version (for rolling functions that are still in Cython) and a Numba version (for rolling functions that get converted to Numba)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 12, 2019

@mroeschke are you hoping to do this for 1.0? Roughly, how much effort is left on #29428, and how much additional effort to close this? Trying to get a sense for where we're at.

Edit: and should the title of this post be updated based on #28987 (comment)? The optional part is still the plan, right?

@mroeschke mroeschke changed the title Add Numba as a required dependency for rolling.mean for pandas 1.0 Add Numba as an optional dependency for rolling.mean for pandas 1.0 Nov 12, 2019
@mroeschke
Copy link
Member Author

@TomAugspurger yes still aiming for this for 1.0.

Hoping to finish #29428 by end of week. After there should be ~2/3 followup PRs (1 more cleanup for the "window indexers", 1 for Numba + rolling apply). These PR's should be a little more straightforward and hoping to finish those in December/end of year.

@rgommers
Copy link
Contributor

@TomAugspurger yes still aiming for this for 1.0.

Just an observation: there's a lot of concerns and potential issues in this issue, with essentially no response, except for "let's make it optional then". If you had discussions in a dev call or something, it would be nice to summarize, because the decision making looks a little odd here.

@TomAugspurger
Copy link
Contributor

No dev call (though this will likely come up during tomorrow's).

My (possibly wrong) understanding of the situation is that

  • We're not ready for a hard dependency on numba
  • Being able to do a .rolling(...).apply(numba_jitted_user_function) all in Numba is compelling enough to warrant some duplicate code (though I haven't followed the discussion on how much)

Though that doesn't quite square with @mroeschke comment

and a Numba version (for rolling functions that get converted to Numba)

Are we converting some functions to Numba? Or are we limiting the scope to taking a numba-path for user-defined functions that have been jitted (by the user).

@mroeschke
Copy link
Member Author

mroeschke commented Nov 12, 2019

The revised proposal #28987 (comment) was in response to the various concerns around Numba and making its use essential fully optional.

Reiterating the proposal points:

  1. Introduce Numba as an optional dependency
  2. Have rolling.apply dispatch to Numba as opt-in behavior (e.g. rolling(...).apply(user_python_func, engine='Numba')) as a potential significant performance boost

@TomAugspurger so the idea is that we would try to jit the users function (and raise if we fail) instead of the users jitting before passing to apply - though we could accomodate the latter as well.

Added the topic as a discussion point for tomorrow's dev call.

@mroeschke mroeschke changed the title Add Numba as an optional dependency for rolling.mean for pandas 1.0 Add Numba as an optional dependency for rolling.apply for pandas 1.0 Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.