Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Arrow to Python list (to_pylist) conversion is slow #28694

Open
asfimport opened this issue Jun 4, 2021 · 15 comments
Open

[Python] Arrow to Python list (to_pylist) conversion is slow #28694

asfimport opened this issue Jun 4, 2021 · 15 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jun 4, 2021

It seems that we are 20x slower than Numpy for converting the exact same data to a Python list.

With integers:

>>> arr = np.arange(0,1000, dtype=np.int64)
>>> %timeit arr.tolist()
8.24 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> parr = pa.array(arr)
>>> %timeit parr.to_pylist()
218 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

With floats:

>>> arr = np.arange(0,1000, dtype=np.float64)
>>> %timeit arr.tolist()
10.2 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> parr = pa.array(arr)
>>> %timeit parr.to_pylist()
199 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Reporter: Antoine Pitrou / @pitrou

Related issues:

Note: This issue was originally created as ARROW-12976. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I see a slightly smaller difference, but anyway still big (10µs vs 200µs instead of 800µs, maybe due to release/debug build?).

I suppose the main reason for the slowness here is that to_pylist is currently implemented by accessing each element of the array (getitem), wrapping that into a pyarrow Scalar object, and then converting that scalar to a native python object (as_py()). I can imagine this getitem / wrapping in a scalar object gives quite some overhead.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Oops, yes, I had posted my number on debug mode :-o My bad ! I'll edit the issue.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Happens to me all the time ;)

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I was thinking that we could in principle put those different steps we now take in cython in a single specialized to_list function on a concrete array class, eg for Int64Array:

    def to_pylist_primitive(self):
        cdef:
            shared_ptr[CScalar] scalar

        res = []
        for i in range(len(self)):
            scalar = GetResultValue(self.ap.GetScalar(i))
            if scalar.get().is_valid:
                res.append((<CInt64Scalar*> scalar.get()).value)
            else:
                res.append(None)
        return res

which still creates the C scalars, but avoids wrapping it the Python (cython) class, and avoids some function call overhead. Now, this only gives a 2x speed-up (and thus still 10x slower as numpy), and a large part of the remaining time is spent in the C++ Array::GetScalar.

So if we really want to improve this, it might need some more custom code at the C++ level to get vectorized access to raw data (at least for primitive arrays). But not really sure this is worth it for to_pylist.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Yes, the only reasonable way to speed this up is to bypass Scalar creation entirely (which is slow).

Since we spent quite some time optimizing the Python-to-Arrow case, it seems worthwhile to optimize the reverse path.

cc @wesm for opinions.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Actually, the APIs are already available to avoid the Scalar creation, for example for Int64Array:

    def to_pylist(self):
        cdef:
            CInt64Array* int_arr = (<CInt64Array*> self.ap)
            int64_t val

        res = []
        for i in range(len(self)):
            if int_arr.IsValid(i):
                val = int_arr.Value(i)
                res.append(val)
            else:
                res.append(None)
        return res

This gives me 13µs for the example case, which is now almost the same as for the numpy tolist.

This might certainly be worth including. Are there ways in cython to avoid having to duplicate this in each of Int8/Int16/Int... Array? The problem is that NumericArray::Value\(i) return type is type-dependent.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
No, Cython doesn't allow generating templated C++ code.
To avoid repeating oneself, it's best to do it in C++. It can also help making it faster.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
@pitrou @jorisvandenbossche going to see if I consolidate this logic in C++ (unless you were thinking of taking it up). Any preference for trying to split up into smaller PRs or one large one to migrate all types to C++ code?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Migrating all types at once sounds more logical to me. I wonder what the overall approach should be, though.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
One thing we discussed on the sync call is if a more explicit API should be provided to control coercion of timestamp[ns] to pd.Timestamp instead of the current behavior that will do the conversion if pandas is installed but fall back to datetime (and check nanoseconds=0) if pandas is not installed. @jorisvandenbossche any thoughts here?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@emkornfield this is specifically for the scalar as_py() and array to_pylist() behaviour, right? (and not the Table.to_pandas)

Personally I would be fine with a more explicit API (the idea would be to add a keyword to those functions to explicitly ask for a pandas object?). But some concerns:

  1. changing to use datetime.datetime instead of pd.Timestamp by default for ns resolution would be a backwards incompatible change. How do we see that? Just change, or deprecate first? (it seems a bit annoying to deprecate, although if we add a keyword, that can directly be used to silence the warning, and I suppose those functions are not used that much anyway)

  2. If you have nanoseconds in the timestamp value, that means we would raise an error by default? (the one we raise now if pandas is not installed) That doesn't feel super nice user experience, but I suppose this is the inevitable consequence of a more explicit API.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
Yeah, given #1 and #2, I think I'll try to simply replicate existing behavior in C++, even though it can lead to unexpected behavior.

@asfimport
Copy link
Collaborator Author

Todd Farmer / @toddfarmer:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

@hitesh-pathak
Copy link

Just started using PyArrow. Are there any updates/recommended practices on this issue, this would significantly speed up "iterating through rows" for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants