PERF: ArrowExtensionArray.to_numpy(dtype=object) #51227

lukemanley · 2023-02-08T02:12:22Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Perf improvement for ArrowExtensionArray.to_numpy(dtype=object).

(dtype=object is not uncommon, see discussion in #22791)

Some examples where it has an impact:

import pandas as pd
import numpy as np

data = np.random.randn(1_000_000, 10)
df = pd.DataFrame(data, dtype="float64[pyarrow]")

%timeit df.sum(axis=1)

# 4.04 s ± 53.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
# 796 ms ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

%timeit df.clip(0.0, 0.1)

# 4.82 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <- main
# 1.51 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

jbrockmendel · 2023-02-08T15:18:37Z

dumb question: why isn't self._data.to_numpy() good enough here?

lukemanley · 2023-02-08T22:41:36Z

dumb question: why isn't self._data.to_numpy() good enough here?

I believe there is a desire to maintain the raw dtype (even if wrapped in an object array) and use pd.NA by default.

For instance, this converts int to float when nulls are present:

In [1]: import pandas as pd

In [2]: arr = pd.array([1, None], dtype="int64[pyarrow]")

In [3]: arr._data.to_numpy()
Out[3]: array([ 1., nan])

I'm not sure its required, but the current implementation aligns mostly with BaseMaskedArray:

In [4]: arr = pd.array([1, None], dtype="Int64")

In [5]: arr.to_numpy()
Out[5]: array([1, <NA>], dtype=object)

In [6]: arr = pd.array([1, None], dtype="int64[pyarrow]")

In [7]: arr.to_numpy()
Out[7]: array([1, <NA>], dtype=object)

mroeschke · 2023-02-09T17:33:50Z

Thanks @lukemanley

phofl · 2023-02-09T18:23:23Z

Not totally sure, but looks like this is failing the ci on main. Any idea why?

mroeschke · 2023-02-09T18:31:22Z

Ah darn. Should we revert this in the meantime?

phofl · 2023-02-09T18:32:22Z

Probably a good idea, if we don't know how to fix

This reverts commit 7ffc0ad.

mroeschke · 2023-02-09T18:37:37Z

Yeah locally reverting this change fixes the test failure

Revert "PERF: ArrowExtensionArray.to_numpy(dtype=object) (#51227)" This reverts commit 7ffc0ad.

PERF: ArrowExtensionArray.to_numpy(dtype=object)

0f99ac3

lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Feb 8, 2023

gh refs

f098c58

mroeschke approved these changes Feb 9, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Feb 9, 2023

mroeschke merged commit 7ffc0ad into pandas-dev:main Feb 9, 2023

phofl mentioned this pull request Feb 9, 2023

Revert "PERF: ArrowExtensionArray.to_numpy(dtype=object)" #51273

Merged

phofl added a commit that referenced this pull request Feb 9, 2023

Revert "PERF: ArrowExtensionArray.to_numpy(dtype=object) (#51227)"

645a244

This reverts commit 7ffc0ad.

mroeschke pushed a commit that referenced this pull request Feb 9, 2023

Revert "PERF: ArrowExtensionArray.to_numpy(dtype=object)" (#51273)

3e8c3b0

Revert "PERF: ArrowExtensionArray.to_numpy(dtype=object) (#51227)" This reverts commit 7ffc0ad.

lukemanley mentioned this pull request Feb 12, 2023

PERF: ArrowExtensionArray.to_numpy(dtype=object) #51347

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: ArrowExtensionArray.to_numpy(dtype=object) #51227

PERF: ArrowExtensionArray.to_numpy(dtype=object) #51227

lukemanley commented Feb 8, 2023 •

edited

Loading

jbrockmendel commented Feb 8, 2023

lukemanley commented Feb 8, 2023

mroeschke commented Feb 9, 2023

phofl commented Feb 9, 2023

mroeschke commented Feb 9, 2023

phofl commented Feb 9, 2023

mroeschke commented Feb 9, 2023

PERF: ArrowExtensionArray.to_numpy(dtype=object) #51227

PERF: ArrowExtensionArray.to_numpy(dtype=object) #51227

Conversation

lukemanley commented Feb 8, 2023 • edited Loading

jbrockmendel commented Feb 8, 2023

lukemanley commented Feb 8, 2023

mroeschke commented Feb 9, 2023

phofl commented Feb 9, 2023

mroeschke commented Feb 9, 2023

phofl commented Feb 9, 2023

mroeschke commented Feb 9, 2023

lukemanley commented Feb 8, 2023 •

edited

Loading