[Python] Arrow to Python list (to_pylist) conversion is slow #28694

asfimport · 2021-06-04T17:05:42Z

It seems that we are 20x slower than Numpy for converting the exact same data to a Python list.

With integers:

>>> arr = np.arange(0,1000, dtype=np.int64)
>>> %timeit arr.tolist()
8.24 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> parr = pa.array(arr)
>>> %timeit parr.to_pylist()
218 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

With floats:

>>> arr = np.arange(0,1000, dtype=np.float64)
>>> %timeit arr.tolist()
10.2 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> parr = pa.array(arr)
>>> %timeit parr.to_pylist()
199 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Reporter: Antoine Pitrou / @pitrou

Related issues:

[Python] Efficient "row accessor" for a pyarrow RecordBatch / Table (is related to)

_{Note: This issue was originally created as ARROW-12976. Please see the migration documentation for further details.}

asfimport · 2021-06-04T17:07:41Z

Antoine Pitrou / @pitrou:
@jorisvandenbossche @amol-

asfimport · 2021-06-07T14:42:30Z

Joris Van den Bossche / @jorisvandenbossche:
I see a slightly smaller difference, but anyway still big (10µs vs 200µs instead of 800µs, maybe due to release/debug build?).

I suppose the main reason for the slowness here is that to_pylist is currently implemented by accessing each element of the array (getitem), wrapping that into a pyarrow Scalar object, and then converting that scalar to a native python object (as_py()). I can imagine this getitem / wrapping in a scalar object gives quite some overhead.

asfimport · 2021-06-07T14:48:41Z

Antoine Pitrou / @pitrou:
Oops, yes, I had posted my number on debug mode :-o My bad ! I'll edit the issue.

asfimport · 2021-06-07T14:54:05Z

Joris Van den Bossche / @jorisvandenbossche:
Happens to me all the time ;)

asfimport · 2021-06-07T15:26:00Z

Joris Van den Bossche / @jorisvandenbossche:
I was thinking that we could in principle put those different steps we now take in cython in a single specialized to_list function on a concrete array class, eg for Int64Array:

    def to_pylist_primitive(self):
        cdef:
            shared_ptr[CScalar] scalar

        res = []
        for i in range(len(self)):
            scalar = GetResultValue(self.ap.GetScalar(i))
            if scalar.get().is_valid:
                res.append((<CInt64Scalar*> scalar.get()).value)
            else:
                res.append(None)
        return res

which still creates the C scalars, but avoids wrapping it the Python (cython) class, and avoids some function call overhead. Now, this only gives a 2x speed-up (and thus still 10x slower as numpy), and a large part of the remaining time is spent in the C++ Array::GetScalar.

So if we really want to improve this, it might need some more custom code at the C++ level to get vectorized access to raw data (at least for primitive arrays). But not really sure this is worth it for to_pylist.

asfimport · 2021-06-07T15:29:04Z

Antoine Pitrou / @pitrou:
Yes, the only reasonable way to speed this up is to bypass Scalar creation entirely (which is slow).

Since we spent quite some time optimizing the Python-to-Arrow case, it seems worthwhile to optimize the reverse path.

cc @wesm for opinions.

asfimport · 2021-06-07T15:48:27Z

Joris Van den Bossche / @jorisvandenbossche:
Actually, the APIs are already available to avoid the Scalar creation, for example for Int64Array:

    def to_pylist(self):
        cdef:
            CInt64Array* int_arr = (<CInt64Array*> self.ap)
            int64_t val

        res = []
        for i in range(len(self)):
            if int_arr.IsValid(i):
                val = int_arr.Value(i)
                res.append(val)
            else:
                res.append(None)
        return res

This gives me 13µs for the example case, which is now almost the same as for the numpy tolist.

This might certainly be worth including. Are there ways in cython to avoid having to duplicate this in each of Int8/Int16/Int... Array? The problem is that NumericArray::Value\(i) return type is type-dependent.

asfimport · 2021-06-07T15:50:29Z

Antoine Pitrou / @pitrou:
No, Cython doesn't allow generating templated C++ code.
To avoid repeating oneself, it's best to do it in C++. It can also help making it faster.

asfimport · 2021-10-10T04:46:47Z

Micah Kornfield / @emkornfield:
@pitrou @jorisvandenbossche going to see if I consolidate this logic in C++ (unless you were thinking of taking it up). Any preference for trying to split up into smaller PRs or one large one to migrate all types to C++ code?

asfimport · 2021-10-13T15:13:03Z

Antoine Pitrou / @pitrou:
Migrating all types at once sounds more logical to me. I wonder what the overall approach should be, though.

asfimport · 2021-10-16T04:27:22Z

Micah Kornfield / @emkornfield:
One thing we discussed on the sync call is if a more explicit API should be provided to control coercion of timestamp[ns] to pd.Timestamp instead of the current behavior that will do the conversion if pandas is installed but fall back to datetime (and check nanoseconds=0) if pandas is not installed. @jorisvandenbossche any thoughts here?

asfimport · 2021-10-21T11:40:17Z

Joris Van den Bossche / @jorisvandenbossche:
@emkornfield this is specifically for the scalar as_py() and array to_pylist() behaviour, right? (and not the Table.to_pandas)

Personally I would be fine with a more explicit API (the idea would be to add a keyword to those functions to explicitly ask for a pandas object?). But some concerns:

changing to use datetime.datetime instead of pd.Timestamp by default for ns resolution would be a backwards incompatible change. How do we see that? Just change, or deprecate first? (it seems a bit annoying to deprecate, although if we add a keyword, that can directly be used to silence the warning, and I suppose those functions are not used that much anyway)
If you have nanoseconds in the timestamp value, that means we would raise an error by default? (the one we raise now if pandas is not installed) That doesn't feel super nice user experience, but I suppose this is the inevitable consequence of a more explicit API.

asfimport · 2021-10-24T03:39:23Z

Micah Kornfield / @emkornfield:
Yeah, given #1 and #2, I think I'll try to simply replicate existing behavior in C++, even though it can lead to unexpected behavior.

asfimport · 2022-07-12T14:04:25Z

Todd Farmer / @toddfarmer:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

hitesh-pathak · 2024-03-24T12:53:50Z

Just started using PyArrow. Are there any updates/recommended practices on this issue, this would significantly speed up "iterating through rows" for me.

asfimport mentioned this issue Jan 11, 2023

[Python] Efficient "row accessor" for a pyarrow RecordBatch / Table #28689

Open

jorisvandenbossche mentioned this issue Mar 3, 2023

to_numpy().tolist() is significantlly faster than .tolist() #34354

Closed

jorisvandenbossche changed the title ~~[Python] Arrow-to-Python conversion is slow~~ [Python] Arrow to Python list (to_pylist) conversion is slow Mar 23, 2023

AlenkaF mentioned this issue Mar 23, 2023

[Python] Pandas roundtrip doesn't preserve list of datetime objects #19770

Closed

jorisvandenbossche mentioned this issue Jan 10, 2024

refactor(python): Document, prefix, and add reprs for C-wrapping classes apache/arrow-nanoarrow#340

Merged

aidoskanapyanov mentioned this issue Jul 23, 2024

feat: pyarrow dataframe iter_rows narwhals-dev/narwhals#596

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Arrow to Python list (to_pylist) conversion is slow #28694

[Python] Arrow to Python list (to_pylist) conversion is slow #28694

asfimport commented Jun 4, 2021 •

edited

Loading

asfimport commented Jun 4, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Oct 10, 2021

asfimport commented Oct 13, 2021

asfimport commented Oct 16, 2021

asfimport commented Oct 21, 2021

asfimport commented Oct 24, 2021

asfimport commented Jul 12, 2022

hitesh-pathak commented Mar 24, 2024

[Python] Arrow to Python list (to_pylist) conversion is slow #28694

[Python] Arrow to Python list (to_pylist) conversion is slow #28694

Comments

asfimport commented Jun 4, 2021 • edited Loading

Related issues:

asfimport commented Jun 4, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Oct 10, 2021

asfimport commented Oct 13, 2021

asfimport commented Oct 16, 2021

asfimport commented Oct 21, 2021

asfimport commented Oct 24, 2021

asfimport commented Jul 12, 2022

hitesh-pathak commented Mar 24, 2024

asfimport commented Jun 4, 2021 •

edited

Loading