-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Numba as an optional dependency for rolling.apply for pandas 1.0 #28987
Comments
Hi all. This proposal has a lot of implications that I don't see addressed in the PoC document. I would suggest reviewing this thread about introducing Numba as a SciPy dependency from last year. Some concrete questions that may be showstoppers are:
Pydata/Sparse, which is pure Python + Numba, is thinking about incrementally moving parts back to compiled code, due to stability and currently missing features in Numba. And I think that Scikit-learn experimented with this fairly recently and also came away with "not ready yet" (second-hand info via @amueller, please correct me if I'm wrong @NicolasHug). |
I’m sympathetic to the goal here, but my sense is that Numba can still be a challenging project to redistribute, so I’m not sure this is a good idea. Not everyone uses conda or pip. Just look at the building from source instructions for patching LLVM, for example: https://llvmlite.readthedocs.io/en/latest/admin-guide/install.html. A strength of pandas has been that it’s easy to install with a standard toolchain, and has a minimal set of required dependencies. In your linked doc, you list three benefits from using Numba:
With all due respect, I question benefits (2) and (3). Removing all C extensions would be to remove the need for a C compiler when building pandas from source, but this seems quite farfetched for pandas. Until that happens, I’m not sure why this would be productive. C extensions are a mature and stable technology. Calling Numba wrapped code “pure Python code” from a debugging perspective also seems misguided. It’s only pure Python until you compile it with Numba, at which point you can debug Numba’s compiler and LLVM IR. I’m pretty confident that the toolchain for debugging Cython’s generated C code is much more battle tested. For example, Cython is actually capable of embedding arbitrary Python, while Numba requires dropping back into “object” mode, which as far as I understand entails skipping most of the compiler. |
As mentioned in the last dev call, my (very limited) experience with numba suggests that it is very easy (= few lines of decorator) to use it without hard depending on it, and leaving the user with the pure Python implementation if numba is missing. While such pure Python implementation would be very slow, this would affect only rolling functions which are not necessarily used by all users (at that time, we would probably emit a warning). In any case, while I would not impose the burden of always writing a numba and a numpy implementation of rolling functions, this does not mean we have to kill existing numpy implementations (as a fallback) in 1.0. So if we decide to go ahead introducing numba in 1.0 (and on this, other people have better judgement than me), I still would refrain, if possible, from introducing the hard dependency. |
If you're going to use
Sure, but that doesn't mean we should sacrifice performance. Ideally, there should be a non-negative benefit for all users if we made such a switch. |
Yes, I don't think Numba is mature enough for scikit-learn to switch from Cython. I've used numba extensively in pygbm and more recently in a toy project (hmmkay). Here are the main arguments I have against it:
With all this I tend to question the "ease of use" argument of numba, I don't find it to be true, for now. Regarding allowing a soft dependency on Numba: numba code is Python code, but it's not pythonic code. For numba to generate efficient code you need to unfold |
With one of the major concerns in this thread being the installation dependency, is there a comprehensive listing of what platforms this would impact? I'm not sure we've been explicit ourselves as to what platforms are supported; if it is a complicating factor for niche platforms I don't think that should be a showstopper FWIW I think it's definitely worth exploring. The performance improvements on lambda expressions are great, and while we already have a lot of Cython architecture in place I would argue that we may be biased in assessing Cython as easier to debug than Numba. I don't find that we get a lot of newcomers that jump into the Cython code, nor is our Cython code always clean or consistent. I could see Numba potentially being easier to digest |
A lot of the conversation has been about numba vs cython, but we also have
a soft dependency on bottleneck that is used in core.nanops that we might
be able to remove if we were to use numba. IIRC @shoyer did a proof of
concept for switching over some of those functions?
bottleneck's last release was May 2017 and there are some bugs e.g.
#22385 that don't seem to have
any movement. I would view removing this dependency as an upside to using
numba.
…On Tue, Oct 15, 2019 at 9:08 AM William Ayd ***@***.***> wrote:
With one of the major concerns in this thread being the installation
dependency, is there a comprehensive listing of what platforms this would
impact? I'm not sure we've been explicit ourselves as to what platforms are
supported; if it is a complicating factor for niche platforms I don't think
that should be a showstopper
FWIW I think it's definitely worth exploring. The performance improvements
on lambda expressions are great, and while we already have a lot of Cython
architecture in place I would argue that we may be biased in assessing
Cython as easier to debug than Numba. I don't find that we get a lot of
newcomers that jump into the Cython code, nor is our Cython code always
clean or consistent. I could see Numba potentially being easier to digest
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#28987?email_source=notifications&email_token=AB5UM6FA6ASQEGLADCL4D2LQOXTJVA5CNFSM4JAXTJC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBJKW7Y#issuecomment-542288767>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6EWB7QB6PMPSCVMAG3QOXTJVANCNFSM4JAXTJCQ>
.
|
It is straightforward to rewrite most of the algorithms in bottleneck in Numba (e.g., like I did in http://github.com/shoyer/numbagg), but also in Cython or C++. If somebody cares about removing the bottleneck dependency, there are any number of fine ways to do so. Actually C++ is really the elephant in the room here. It is definitely the most mature technology for high performance, high level programming.
Bottleneck has a new maintainer and recently moved over in the pydata GitHub organization. I don't think it's fair to consider #22385 a bug. Bottleneck doesn't implement pairwise summation, but that doesn't make it wrong, it just a slightly less numerically stable algorithm for computing sums. |
There are two goals mentioned in the top post: 1) use numba to implement the built-in rolling operations and 2) provide an API for the user to pass a numba function to be used as rolling operation. I think the second goal is also very useful (and for the user give more added value, as the first goal is more an internal change). Could this second goal be implemented without having numba as a hard dependency? |
Hi, Numba core developer here, We agree that there are concerns in using Numba on a project like Pandas, and we would like to prioritise them better. It has been helpful in the past to itemise them and obtain clarity over the issues such that they can be fixed/improved (xref: numba/numba#2888 for the last round of this). I've opened a new issue to start a new discussion over on the Numba issue tracker (xref: numba/numba#4723). Thanks! |
Thanks for all the feedback @pandas-dev/pandas-core I'd like to revise my pandas 1.0 proposal to the following:
The main allure of Numba in its immediate state is being able to write Python and achieve comparable performance with Cython (or C++) thus providing:
A con of adding Numba as a optional dependency for rolling is having to maintain duplicate implementations in Cython and Numba. The burden should go down in the long term as we re-visit Numba as a core dependency and circle back to a single implementation. While a lot of the concerns are around Numba's immaturity, it is a consideration for limiting its application in pandas 1.0 (to only |
Why is this? Would you still include implementations of all built-in rolling functions in pandas? I would think the |
no, you wouldn't need to (at this point), duplicate anything for building in rolling function (e.g. |
Right, I think we'll need 2 implementations for calculating the window boundaries, a Cython version (for rolling functions that are still in Cython) and a Numba version (for rolling functions that get converted to Numba) |
@mroeschke are you hoping to do this for 1.0? Roughly, how much effort is left on #29428, and how much additional effort to close this? Trying to get a sense for where we're at. Edit: and should the title of this post be updated based on #28987 (comment)? The optional part is still the plan, right? |
@TomAugspurger yes still aiming for this for 1.0. Hoping to finish #29428 by end of week. After there should be ~2/3 followup PRs (1 more cleanup for the "window indexers", 1 for Numba + rolling apply). These PR's should be a little more straightforward and hoping to finish those in December/end of year. |
Just an observation: there's a lot of concerns and potential issues in this issue, with essentially no response, except for "let's make it optional then". If you had discussions in a dev call or something, it would be nice to summarize, because the decision making looks a little odd here. |
No dev call (though this will likely come up during tomorrow's). My (possibly wrong) understanding of the situation is that
Though that doesn't quite square with @mroeschke comment
Are we converting some functions to Numba? Or are we limiting the scope to taking a numba-path for user-defined functions that have been jitted (by the user). |
The revised proposal #28987 (comment) was in response to the various concerns around Numba and making its use essential fully optional. Reiterating the proposal points:
@TomAugspurger so the idea is that we would try to jit the users function (and raise if we fail) instead of the users jitting before passing to Added the topic as a discussion point for tomorrow's dev call. |
As mentioned on the pandas dev call last week, I've been working with @jreback and @DiegoAlbertoTorres on a proof of concept (POC) implementing
rolling.mean
androlling.apply
using Numba instead of our current Cython implementation. As described in this proof of concept document, we worked on:df.rolling(MyWindowerClass()).mean()
)The document details performance results, high level implementation details and integration plan into pandas. The fork is up to date with master and is running the same CI checks.
The proposal for pandas 1.0 is:
1) Introduce Numba as a required pandas dependency
2) Have
rolling.mean
dispatch to the Numba implementationThe proposal for post pandas 1.0 is:
groupby
,rolling
,resample
) and aggregations (mean
,max
, etc) using Numba jitclassesAs the issue title notes, I'm hoping to keep this discussion focused around any concerns, thoughts, suggestions regarding the POC and how it pertains to the pandas 1.0 proposal, but feel free to ask any questions regarding topics the POC doesn't cover.
cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: