-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: group by manipulation is slower with new arrow engine #52070
Comments
This is expected. GroupBy isn't implemented for arrow yet |
@phofl thanks for your comments |
Is there a explanation about the current limitations when using arrow? There have been several blog posts talking up benefits of having arrow in Pandas and I think it could be a good idea laying out the current performance limitations of using arrow in Pandas. |
I'm working on a patch here and having trouble making a performant conversion from the ArrowArray to MaskedArray. The non-working method looks like:
But it looks like this is not the correct way to get |
Did you try |
That seems to work, thanks. Branch is ready for once the EA._groupby_op PR is merged. |
That doesn't give you a masked array, though (if that's what is needed). And will make a copy if there are missing values. We normally already have this functionality in the |
@jorisvandenbossche Regarding the copy: I am aware that we create a copy when we have missing values, but wouldn't arrow do the same? |
We only need to copy the bitmask (to convert in a bytemask). The actual data ( |
@jorisvandenbossche I looked into this and |
I took a look at this too and noticed
Doesn't make up the 5x, but I see some improvement if you combine chunks first:
|
Also using the example from the OP, I see even a bigger difference:
But according to a simple profile of the python (non-naive) time, that slow time is almost entirely due to our Series(..) constructor. That seems a separate issue on our side that we should fix (it seems to consider the input masked array as a generic sequence or something like that). Comparing directly with
Doing the pandas/pandas/core/arrays/numeric.py Lines 99 to 100 in d182a34
In general for The specific example data we are using here also doesn't have missing values. Illustrating that when you have missing values, converting to data+mask is faster than
|
This was actually also an issue in pyarrow and not just pandas, where in But this should be fixed on the pyarrow side in the upcoming 12.0 release (where |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Following codes will re-produce my issue:
The new engine is 2x slower than the old engine.
Installed Versions
INSTALLED VERSIONS
commit : c2a7f1a
python : 3.9.16.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936
pandas : 2.0.0rc1
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.1.2
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.10.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy :
sqlalchemy : 1.4.39
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : 2.2.0
pyqt5 : None
Prior Performance
No response
The text was updated successfully, but these errors were encountered: