Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: group by manipulation is slower with new arrow engine #52070

Closed
3 tasks done
wegamekinglc opened this issue Mar 19, 2023 · 13 comments · Fixed by #52469
Closed
3 tasks done

PERF: group by manipulation is slower with new arrow engine #52070

wegamekinglc opened this issue Mar 19, 2023 · 13 comments · Fixed by #52469
Labels
Arrow pyarrow functionality Groupby Performance Memory or execution speed performance

Comments

@wegamekinglc
Copy link

wegamekinglc commented Mar 19, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Following codes will re-produce my issue:

import numpy as np
import pandas as pd

# generate a bunch sample data for later use
n = 5000000

s_samples = [f"s_{i}" for i in range(1, 101)]
i_samples = [f"i_{i}" for i in range(1, 201)]
bool_samples = [True, False]

ssamples = np.random.choice(s_samples, n)
isamples = np.random.choice(i_samples, n)
d_values = np.random.randn(3, n)
b_values = np.random.choice(bool_samples, n)

df = pd.DataFrame(
    dict(s=ssamples, i=isamples, v1=d_values[0], v2=d_values[1], v3=d_values[2], f1=b_values, f2=b_values)
)

df.to_csv("sample.csv", index=None)

# read in data with different engine
df_new = pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow")
df_old = pd.read_csv("sample.csv")

# do the bechmark
%timeit df_new.groupby("s")["v1"].sum()
# >> 660 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_old.groupby("s")["v1"].sum()
# >> 311 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The new engine is 2x slower than the old engine.

Installed Versions

INSTALLED VERSIONS

commit : c2a7f1a
python : 3.9.16.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936

pandas : 2.0.0rc1
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.1.2
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.10.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy :
sqlalchemy : 1.4.39
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : 2.2.0
pyqt5 : None

Prior Performance

No response

@wegamekinglc wegamekinglc added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 19, 2023
@jbrockmendel jbrockmendel added the Arrow pyarrow functionality label Mar 19, 2023
@phofl
Copy link
Member

phofl commented Mar 19, 2023

This is expected. GroupBy isn't implemented for arrow yet

@phofl phofl added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 19, 2023
@wegamekinglc
Copy link
Author

@phofl thanks for your comments

@topper-123
Copy link
Contributor

Is there a explanation about the current limitations when using arrow? There have been several blog posts talking up benefits of having arrow in Pandas and I think it could be a good idea laying out the current performance limitations of using arrow in Pandas.

@jbrockmendel
Copy link
Member

I'm working on a patch here and having trouble making a performant conversion from the ArrowArray to MaskedArray. The non-working method looks like:

def _to_masked(self):
           pa_dtype = self._pa_array.type
        if pa.types.is_floating(pa_dtype):
            nbits = pa_dtype.bit_width
            dtype = f"Float{nbits}"
            np_dtype = dtype.lower()
            from pandas.core.arrays import FloatingArray as arr_cls
        elif pa.types.is_unsigned_integer(pa_dtype):
            nbits = pa_dtype.bit_width
            dtype = f"UInt{nbits}"
            np_dtype = dtype.lower()
            from pandas.core.arrays import IntegerArray as arr_cls
        elif pa.types.is_signed_integer(pa_dtype):
            nbits = pa_dtype.bit_width
            dtype = f"Int{nbits}"
            np_dtype = dtype.lower()
            from pandas.core.arrays import IntegerArray as arr_cls
        elif pa.types.is_boolean(pa_dtype):
            dtype = "boolean"
            np_dtype = "bool"
            from pandas.core.arrays import BooleanArray as arr_cls
        else:
            raise NotImplementedError

        data = self._pa_array.combine_chunks()
        buffs = data.buffers()
        assert len(buffs) == 2
        mask = self.isna()
        arr = np.array(buffs[1], dtype=np_dtype)
        return arr_cls(arr, mask)

But it looks like this is not the correct way to get arr. @jorisvandenbossche how do I get the correct ndarray here?

@phofl
Copy link
Member

phofl commented Apr 5, 2023

Did you try to_numpy on the ArrowExtensionArray and setting the na_value to 1?

@jbrockmendel
Copy link
Member

That seems to work, thanks. Branch is ready for once the EA._groupby_op PR is merged.

@jorisvandenbossche
Copy link
Member

Did you try to_numpy on the ArrowExtensionArray and setting the na_value to 1?

That doesn't give you a masked array, though (if that's what is needed). And will make a copy if there are missing values.

We normally already have this functionality in the __from_arrow__ methods on the MaskedArray classes. The easiest way to directly reuse this might be self._pa_array.to_pandas(types_mapper=pd.io._util._arrow_dtype_mapping().get). But the underlying functionality could maybe be refactored as a more general utility to use internally (pyarrow will returns a Series here, which we don't need)

@phofl
Copy link
Member

phofl commented Apr 7, 2023

@jorisvandenbossche Regarding the copy: I am aware that we create a copy when we have missing values, but wouldn't arrow do the same?

@jorisvandenbossche
Copy link
Member

We only need to copy the bitmask (to convert in a bytemask). The actual data (buffs[1] in Brock's snippet above) don't need to be copied, even if there are missing values.
While relying on pyarrow's to_numpy / __array__ will always give a full copy of the actual data once there are missing values.

@phofl
Copy link
Member

phofl commented Apr 25, 2023

@jorisvandenbossche I looked into this and to_pandas is 5 times slower compared to the to_numpy way. Not sure what's going on under the hood, but I guess that one of the problems Is that we get back a Series

@lukemanley
Copy link
Member

I took a look at this too and noticed pd.read_csv was returning pyarrow chunked arrays with quite a few chunks (using the example in the OP):

df_new.apply(lambda x: x.array._pa_array.num_chunks)

s     383
i     383
v1    383
v2    383
v3    383
f1    383
f2    383
dtype: int64

Doesn't make up the 5x, but I see some improvement if you combine chunks first:

arr = df_new["v1"].array._pa_array

%timeit pd.Float64Dtype().__from_arrow__(arr)
# 17.6 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.Float64Dtype().__from_arrow__(arr.combine_chunks())
# 12.1 ms ± 43.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jorisvandenbossche
Copy link
Member

Also using the example from the OP, I see even a bigger difference:

In [36]: %timeit df_new["v1"].array._pa_array.to_numpy()
20.5 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [38]: %timeit df_new["v1"].array._pa_array.to_pandas(types_mapper=pd.io._util._arrow_dtype_mapping().get)
270 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But according to a simple profile of the python (non-naive) time, that slow time is almost entirely due to our Series(..) constructor. That seems a separate issue on our side that we should fix (it seems to consider the input masked array as a generic sequence or something like that).

Comparing directly with __from_arrow__ (instead of going through to_pandas) as Luke did is therefore indeed a more useful comparison. I see:

In [40]: arr = df_new["v1"].array._pa_array

In [41]: %timeit arr.to_numpy()
20.1 ms ± 917 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %timeit pd.Float64Dtype().__from_arrow__(arr)
35.5 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [43]: %timeit pd.Float64Dtype().__from_arrow__(arr.combine_chunks())
24.9 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Doing the combine_chunks indeed helps, and I am not sure why we actually don't call this inside __from_arrow__ instead of looping over each chunk, converting that, and then concatenating our extension arrays (calling _concat_same_type).
Further, even with combining chunks, this is still a bit slower, and looking at a basic profile of pd.Float64Dtype().__from_arrow__(arr.combine_chunks()), this seems is largely due to an additional copy of the data (which doesn't happen in to_numpy():

data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.numpy_dtype)
num_arr = array_class(data.copy(), ~mask, copy=False)

In general for __from_arrow__ for the masked arrays that might make sense (to ensure the arrays are writeable/mutable, although if you have multiple chunks we concat anyway, so we should actually only do this copy if there is only a single chunk), but of course in the context of just getting the data for feeding it into a groupby algo, this copy is wasteful.


The specific example data we are using here also doesn't have missing values. Illustrating that when you have missing values, converting to data+mask is faster than to_numpy (and using pyarrow_array_to_numpy_and_mask directly, to avoid this combining chunks and unnecessary copy of __from_array__):

In [66]: nparr = np.random.randn(10_000_000)

In [67]: nparr[0] = np.nan

In [68]: arr = pa.array(nparr, from_pandas=True)

In [69]: %timeit arr.to_numpy(zero_copy_only=False)
39.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [70]: from pandas.core.arrays.arrow._arrow_utils import pyarrow_array_to_numpy_and_mask

In [71]: %timeit pyarrow_array_to_numpy_and_mask(arr, np.dtype("float64"))
15.3 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jorisvandenbossche
Copy link
Member

But according to a simple profile of the python (non-naive) time, that slow time is almost entirely due to our Series(..) constructor. That seems a separate issue on our side that we should fix (it seems to consider the input masked array as a generic sequence or something like that).

This was actually also an issue in pyarrow and not just pandas, where in ChunkedArray.to_pandas(), we were not actually calling __from_arrow__, but just converting the array as normal (so in this case to a single numpy float array, just like to_numpy) and then passing this to the Series constructor with the specified dtypes, i.e. doing pd.Series(nparr, dtype=pd.Float64Dtype()). And then this is rather slow on our side (we have to check for NaNs to convert that to a mask, so not sure if that is avoidable).

But this should be fixed on the pyarrow side in the upcoming 12.0 release (where ChunkedArray.to_pandas will now correctly go through __from_arrow__, xref apache/arrow#34559)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants