-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.24.0 vs 0.23.4: scalar + DataFrame is 3000x slower #24990
Comments
@dycw is "DataFrame ops with scalars are slower" a fair summary of the issue? Anything else from your post that's important? cc @jbrockmendel. Is this related to the fix for the broadcasting? We're spending a lot of time in |
@TomAugspurger Yes, I believe all ops should be impacted, beyond add. My test also showed regression with Series too, albeit not orders of magnitude. |
We'll see what profiling shows, but that's likely a different issue. 0.24.0 changed more operations to operation column-wise, so this slowdown scales with the number of columns, which wouldn't affect series. |
Yes, we are doing the operation column-by-column, which definitely has a perf hit in few-row/many-column cases like this. Locally I'm only seeing about a 10x difference, can't speak to the 3000x. side-note: instantiating the DataFrame In the short-medium run, operating column-wise is the only way we could find to make Series and DataFrame behavior consistent. medium-long run I think the way to address this perf issue is to dispatch to blocks instead of columns. This is a big part of why I want EA to be able to handle 2D arrays. |
Are EAs being 1-d prohibiting blockwise application of these ops? I would think those are orthogonal. Each EA column would be done on its own. But a frame with 1,000 float columns and one EA would end up with two calls to |
I could have made this clearer. The functions in core.ops that define arithmetic/comparison ops for Series are pretty easy to adapt to the 2D case (in fact, some are already dim-agnostic). The path I have in mind is:
|
First, assume a can opener :) (fellow economists should get that). More seriously, I've been thinking of ways we could opt pandas into 2D arrays, without putting that complexity on users. Nothing concrete yet though. In particular, it's not clear to me that a 2D EA isn't just re-inventing Block. I'd be interested in seeing if we can get ops working blockwise with the current mix of ndarray-backed Blocks and EA-backed ops (though this isn't a high priority for me right now). |
Maybe we should discuss this in a dedicated Issue? I like this as a potential compromise on the 1D vs drop-in-for-ndarray tradeoff. It would make it feasible for me to put together a proof of concept for the block-based arithmetic.
Block would be a much simpler beast if it a) didn't have to carry around |
@TomAugspurger : Had to interject, but I approve 🙂 |
I agree with @TomAugspurger here. This has been a big box of works (more than a can :>) for quite some time. Blocks are rather efficient for some operations but have some costs:
So its not a simple, 'just use 2-D EA'. as you can see some obvious benefits, but there is a really long tail of hidden costs. I have basically switched my view over the years from being pro-blocks (for getting perf benefits), to pro-columns because of the simplicity, sure we do have some perf costs but IMHO this is easily out-weighted by reduced complexity. Anyhow, let's open a dedicated issue about this. |
I profiled the case of
So it is using 5000x5000 dataframe instead of 1x5000 (still a wide dataframe but a bit more realistic case I think). On master this takes around 2 seconds which is 20x slower as before the change (10x with the original timeit in the OP, but exclusing the dataframe creation makes it more pronounced). With a few relatively simple optimizations, I got this down from 20x slowdown to 5x slowdown. Those include:
Additionally, the recreation of the array spends quite some time in sanitizing the array and checking the block types, which should in principle not be necessary. Of course this was a very simplified case (I only looked at the operation with a scalar, a case without nans, etc), but I think it at least shows that if we care about this performance issue for wide dataframes, it can be substantially improved quite a bit. |
@jbrockmendel 2D EAs should fix this regression, right? |
yes (well, 2D EA will make it feasible to fix in a follow-up) |
Think it's doable for the 1.0 release (roughly, September)?
…On Fri, Aug 9, 2019 at 9:31 AM jbrockmendel ***@***.***> wrote:
yes
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24990?email_source=notifications&email_token=AAKAOIQEK6B5YA5VHSOF5RTQDV5TDA5CNFSM4GS4RGGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3623LI#issuecomment-519941549>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIRB7F7D55NSGOOL7BDQDV5TDANCNFSM4GS4RGGA>
.
|
At the current pace, no, I don't expect 2D EA to be in master by the end of September. If everyone gets on board, that could change quickly. Working on plan B to get a perf fix in more promptly |
Benchmarking #29853 against master with the examples from the OP:
Speedups of 6197x and 6410x. |
Closed by #29853 |
Code Sample, a copy-pastable example if possible
First, two
conda
environments:Now, a generalized script testing all 8 combinations of
scalar
+ndframe
wherescalar in [0, 0.0]
andndframe
is aSeries
orDataFrame
with 5000NaNs
; combinations reverse the order too.Problem description
A table of timings:
These are collated from the following:
0.23.4
0.24.0
If the issue has not been resolved there, go ahead and file it in the issue tracker.
Expected Output
Output of
pd.show_versions()
0.23.4
INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
0.24.0
INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.0
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: