Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: ArrowExtensionArray.to_numpy(dtype=object) #51227

Merged
merged 2 commits into from
Feb 9, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Feb 8, 2023

Perf improvement for ArrowExtensionArray.to_numpy(dtype=object).

(dtype=object is not uncommon, see discussion in #22791)

Some examples where it has an impact:

import pandas as pd
import numpy as np

data = np.random.randn(1_000_000, 10)
df = pd.DataFrame(data, dtype="float64[pyarrow]")

%timeit df.sum(axis=1)

# 4.04 s ± 53.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
# 796 ms ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

%timeit df.clip(0.0, 0.1)

# 4.82 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <- main
# 1.51 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Feb 8, 2023
@jbrockmendel
Copy link
Member

dumb question: why isn't self._data.to_numpy() good enough here?

@lukemanley
Copy link
Member Author

dumb question: why isn't self._data.to_numpy() good enough here?

I believe there is a desire to maintain the raw dtype (even if wrapped in an object array) and use pd.NA by default.

For instance, this converts int to float when nulls are present:

In [1]: import pandas as pd

In [2]: arr = pd.array([1, None], dtype="int64[pyarrow]")

In [3]: arr._data.to_numpy()
Out[3]: array([ 1., nan])

I'm not sure its required, but the current implementation aligns mostly with BaseMaskedArray:

In [4]: arr = pd.array([1, None], dtype="Int64")

In [5]: arr.to_numpy()
Out[5]: array([1, <NA>], dtype=object)

In [6]: arr = pd.array([1, None], dtype="int64[pyarrow]")

In [7]: arr.to_numpy()
Out[7]: array([1, <NA>], dtype=object)

@mroeschke mroeschke added this to the 2.0 milestone Feb 9, 2023
@mroeschke mroeschke merged commit 7ffc0ad into pandas-dev:main Feb 9, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

@phofl
Copy link
Member

phofl commented Feb 9, 2023

Not totally sure, but looks like this is failing the ci on main. Any idea why?

@mroeschke
Copy link
Member

Ah darn. Should we revert this in the meantime?

@phofl
Copy link
Member

phofl commented Feb 9, 2023

Probably a good idea, if we don't know how to fix

@mroeschke
Copy link
Member

Yeah locally reverting this change fixes the test failure

mroeschke pushed a commit that referenced this pull request Feb 9, 2023
Revert "PERF: ArrowExtensionArray.to_numpy(dtype=object) (#51227)"

This reverts commit 7ffc0ad.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants