PERF: group by manipulation is slower with new arrow engine #52070

wegamekinglc · 2023-03-19T06:54:27Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Following codes will re-produce my issue:

import numpy as np
import pandas as pd

# generate a bunch sample data for later use
n = 5000000

s_samples = [f"s_{i}" for i in range(1, 101)]
i_samples = [f"i_{i}" for i in range(1, 201)]
bool_samples = [True, False]

ssamples = np.random.choice(s_samples, n)
isamples = np.random.choice(i_samples, n)
d_values = np.random.randn(3, n)
b_values = np.random.choice(bool_samples, n)

df = pd.DataFrame(
    dict(s=ssamples, i=isamples, v1=d_values[0], v2=d_values[1], v3=d_values[2], f1=b_values, f2=b_values)
)

df.to_csv("sample.csv", index=None)

# read in data with different engine
df_new = pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow")
df_old = pd.read_csv("sample.csv")

# do the bechmark
%timeit df_new.groupby("s")["v1"].sum()
# >> 660 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_old.groupby("s")["v1"].sum()
# >> 311 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The new engine is 2x slower than the old engine.

Installed Versions

INSTALLED VERSIONS

commit : c2a7f1a
python : 3.9.16.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936

pandas : 2.0.0rc1
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.1.2
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.10.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy :
sqlalchemy : 1.4.39
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : 2.2.0
pyqt5 : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

phofl · 2023-03-19T16:51:18Z

This is expected. GroupBy isn't implemented for arrow yet

wegamekinglc · 2023-03-20T02:26:10Z

@phofl thanks for your comments

topper-123 · 2023-03-25T08:56:39Z

Is there a explanation about the current limitations when using arrow? There have been several blog posts talking up benefits of having arrow in Pandas and I think it could be a good idea laying out the current performance limitations of using arrow in Pandas.

jbrockmendel · 2023-04-05T00:10:08Z

I'm working on a patch here and having trouble making a performant conversion from the ArrowArray to MaskedArray. The non-working method looks like:

def _to_masked(self):
           pa_dtype = self._pa_array.type
        if pa.types.is_floating(pa_dtype):
            nbits = pa_dtype.bit_width
            dtype = f"Float{nbits}"
            np_dtype = dtype.lower()
            from pandas.core.arrays import FloatingArray as arr_cls
        elif pa.types.is_unsigned_integer(pa_dtype):
            nbits = pa_dtype.bit_width
            dtype = f"UInt{nbits}"
            np_dtype = dtype.lower()
            from pandas.core.arrays import IntegerArray as arr_cls
        elif pa.types.is_signed_integer(pa_dtype):
            nbits = pa_dtype.bit_width
            dtype = f"Int{nbits}"
            np_dtype = dtype.lower()
            from pandas.core.arrays import IntegerArray as arr_cls
        elif pa.types.is_boolean(pa_dtype):
            dtype = "boolean"
            np_dtype = "bool"
            from pandas.core.arrays import BooleanArray as arr_cls
        else:
            raise NotImplementedError

        data = self._pa_array.combine_chunks()
        buffs = data.buffers()
        assert len(buffs) == 2
        mask = self.isna()
        arr = np.array(buffs[1], dtype=np_dtype)
        return arr_cls(arr, mask)

But it looks like this is not the correct way to get arr. @jorisvandenbossche how do I get the correct ndarray here?

phofl · 2023-04-05T08:10:42Z

Did you try to_numpy on the ArrowExtensionArray and setting the na_value to 1?

jbrockmendel · 2023-04-05T22:02:17Z

That seems to work, thanks. Branch is ready for once the EA._groupby_op PR is merged.

jorisvandenbossche · 2023-04-06T19:54:12Z

Did you try to_numpy on the ArrowExtensionArray and setting the na_value to 1?

That doesn't give you a masked array, though (if that's what is needed). And will make a copy if there are missing values.

We normally already have this functionality in the __from_arrow__ methods on the MaskedArray classes. The easiest way to directly reuse this might be self._pa_array.to_pandas(types_mapper=pd.io._util._arrow_dtype_mapping().get). But the underlying functionality could maybe be refactored as a more general utility to use internally (pyarrow will returns a Series here, which we don't need)

phofl · 2023-04-07T11:22:16Z

@jorisvandenbossche Regarding the copy: I am aware that we create a copy when we have missing values, but wouldn't arrow do the same?

jorisvandenbossche · 2023-04-07T12:13:54Z

We only need to copy the bitmask (to convert in a bytemask). The actual data (buffs[1] in Brock's snippet above) don't need to be copied, even if there are missing values.
While relying on pyarrow's to_numpy / __array__ will always give a full copy of the actual data once there are missing values.

phofl · 2023-04-25T20:54:10Z

@jorisvandenbossche I looked into this and to_pandas is 5 times slower compared to the to_numpy way. Not sure what's going on under the hood, but I guess that one of the problems Is that we get back a Series

lukemanley · 2023-04-25T22:09:02Z

I took a look at this too and noticed pd.read_csv was returning pyarrow chunked arrays with quite a few chunks (using the example in the OP):

df_new.apply(lambda x: x.array._pa_array.num_chunks)

s     383
i     383
v1    383
v2    383
v3    383
f1    383
f2    383
dtype: int64

Doesn't make up the 5x, but I see some improvement if you combine chunks first:

arr = df_new["v1"].array._pa_array

%timeit pd.Float64Dtype().__from_arrow__(arr)
# 17.6 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.Float64Dtype().__from_arrow__(arr.combine_chunks())
# 12.1 ms ± 43.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jorisvandenbossche · 2023-04-26T10:57:58Z

Also using the example from the OP, I see even a bigger difference:

In [36]: %timeit df_new["v1"].array._pa_array.to_numpy()
20.5 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [38]: %timeit df_new["v1"].array._pa_array.to_pandas(types_mapper=pd.io._util._arrow_dtype_mapping().get)
270 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But according to a simple profile of the python (non-naive) time, that slow time is almost entirely due to our Series(..) constructor. That seems a separate issue on our side that we should fix (it seems to consider the input masked array as a generic sequence or something like that).

Comparing directly with __from_arrow__ (instead of going through to_pandas) as Luke did is therefore indeed a more useful comparison. I see:

In [40]: arr = df_new["v1"].array._pa_array

In [41]: %timeit arr.to_numpy()
20.1 ms ± 917 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %timeit pd.Float64Dtype().__from_arrow__(arr)
35.5 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [43]: %timeit pd.Float64Dtype().__from_arrow__(arr.combine_chunks())
24.9 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Doing the combine_chunks indeed helps, and I am not sure why we actually don't call this inside __from_arrow__ instead of looping over each chunk, converting that, and then concatenating our extension arrays (calling _concat_same_type).
Further, even with combining chunks, this is still a bit slower, and looking at a basic profile of pd.Float64Dtype().__from_arrow__(arr.combine_chunks()), this seems is largely due to an additional copy of the data (which doesn't happen in to_numpy():

pandas/pandas/core/arrays/numeric.py

Lines 99 to 100 in d182a34

    
           data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.numpy_dtype) 
        
           num_arr = array_class(data.copy(), ~mask, copy=False)

In general for __from_arrow__ for the masked arrays that might make sense (to ensure the arrays are writeable/mutable, although if you have multiple chunks we concat anyway, so we should actually only do this copy if there is only a single chunk), but of course in the context of just getting the data for feeding it into a groupby algo, this copy is wasteful.

The specific example data we are using here also doesn't have missing values. Illustrating that when you have missing values, converting to data+mask is faster than to_numpy (and using pyarrow_array_to_numpy_and_mask directly, to avoid this combining chunks and unnecessary copy of __from_array__):

In [66]: nparr = np.random.randn(10_000_000)

In [67]: nparr[0] = np.nan

In [68]: arr = pa.array(nparr, from_pandas=True)

In [69]: %timeit arr.to_numpy(zero_copy_only=False)
39.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [70]: from pandas.core.arrays.arrow._arrow_utils import pyarrow_array_to_numpy_and_mask

In [71]: %timeit pyarrow_array_to_numpy_and_mask(arr, np.dtype("float64"))
15.3 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

jorisvandenbossche · 2023-04-26T11:26:10Z

But according to a simple profile of the python (non-naive) time, that slow time is almost entirely due to our Series(..) constructor. That seems a separate issue on our side that we should fix (it seems to consider the input masked array as a generic sequence or something like that).

This was actually also an issue in pyarrow and not just pandas, where in ChunkedArray.to_pandas(), we were not actually calling __from_arrow__, but just converting the array as normal (so in this case to a single numpy float array, just like to_numpy) and then passing this to the Series constructor with the specified dtypes, i.e. doing pd.Series(nparr, dtype=pd.Float64Dtype()). And then this is rather slow on our side (we have to check for NaNs to convert that to a mask, so not sure if that is avoidable).

But this should be fixed on the pyarrow side in the upcoming 12.0 release (where ChunkedArray.to_pandas will now correctly go through __from_arrow__, xref apache/arrow#34559)

wegamekinglc added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 19, 2023

jbrockmendel added the Arrow pyarrow functionality label Mar 19, 2023

phofl added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 19, 2023

jbrockmendel mentioned this issue Apr 5, 2023

PERF: groupby reductions with pyarrow dtypes #52469

Merged

5 tasks

mroeschke closed this as completed in #52469 Apr 7, 2023

jorisvandenbossche mentioned this issue Apr 26, 2023

PERF/CLN: let pyarrow concat chunks instead of doing it ourselves in __from_arrow__ #52928

Merged

2 tasks

mroeschke mentioned this issue Jun 20, 2023

PERF: Huge regression in groupby + sum in dtype_backend == 'pyarrow'. #53737

Closed

3 tasks

rhshadrach mentioned this issue Jul 10, 2023

PERF: agg is an order of magnitude slower with pyarrow dtypes #54065

Open

3 tasks

This was referenced Jul 20, 2023

PERF: GroupBy.mean orders of magnitude slower for pyarrow dtypes. (2.0.3) #54207

Closed

PERF: GroupBy.mean orders of magnitude slower for pyarrow dtypes. (2.0.3) #54208

Closed

brynpickering mentioned this issue Sep 26, 2023

Fix long run time and reintroduce timing benchmark test arup-group/pam#258

Merged

3 tasks

hagenw mentioned this issue Feb 7, 2024

Benchmark Dependency.load() and save() audeering/audb#364

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: group by manipulation is slower with new arrow engine #52070

PERF: group by manipulation is slower with new arrow engine #52070

wegamekinglc commented Mar 19, 2023 •

edited

Loading

INSTALLED VERSIONS

phofl commented Mar 19, 2023

wegamekinglc commented Mar 20, 2023

topper-123 commented Mar 25, 2023

jbrockmendel commented Apr 5, 2023

phofl commented Apr 5, 2023

jbrockmendel commented Apr 5, 2023

jorisvandenbossche commented Apr 6, 2023

phofl commented Apr 7, 2023

jorisvandenbossche commented Apr 7, 2023

phofl commented Apr 25, 2023

lukemanley commented Apr 25, 2023

jorisvandenbossche commented Apr 26, 2023

jorisvandenbossche commented Apr 26, 2023

PERF: group by manipulation is slower with new arrow engine #52070

PERF: group by manipulation is slower with new arrow engine #52070

Comments

wegamekinglc commented Mar 19, 2023 • edited Loading

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

phofl commented Mar 19, 2023

wegamekinglc commented Mar 20, 2023

topper-123 commented Mar 25, 2023

jbrockmendel commented Apr 5, 2023

phofl commented Apr 5, 2023

jbrockmendel commented Apr 5, 2023

jorisvandenbossche commented Apr 6, 2023

phofl commented Apr 7, 2023

jorisvandenbossche commented Apr 7, 2023

phofl commented Apr 25, 2023

lukemanley commented Apr 25, 2023

jorisvandenbossche commented Apr 26, 2023

jorisvandenbossche commented Apr 26, 2023

wegamekinglc commented Mar 19, 2023 •

edited

Loading