-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Arrow to Python list (to_pylist) conversion is slow #28694
Comments
Joris Van den Bossche / @jorisvandenbossche: I suppose the main reason for the slowness here is that |
Antoine Pitrou / @pitrou: |
Joris Van den Bossche / @jorisvandenbossche: |
Joris Van den Bossche / @jorisvandenbossche: def to_pylist_primitive(self):
cdef:
shared_ptr[CScalar] scalar
res = []
for i in range(len(self)):
scalar = GetResultValue(self.ap.GetScalar(i))
if scalar.get().is_valid:
res.append((<CInt64Scalar*> scalar.get()).value)
else:
res.append(None)
return res which still creates the C scalars, but avoids wrapping it the Python (cython) class, and avoids some function call overhead. Now, this only gives a 2x speed-up (and thus still 10x slower as numpy), and a large part of the remaining time is spent in the C++ So if we really want to improve this, it might need some more custom code at the C++ level to get vectorized access to raw data (at least for primitive arrays). But not really sure this is worth it for |
Antoine Pitrou / @pitrou: Since we spent quite some time optimizing the Python-to-Arrow case, it seems worthwhile to optimize the reverse path. cc @wesm for opinions. |
Joris Van den Bossche / @jorisvandenbossche: def to_pylist(self):
cdef:
CInt64Array* int_arr = (<CInt64Array*> self.ap)
int64_t val
res = []
for i in range(len(self)):
if int_arr.IsValid(i):
val = int_arr.Value(i)
res.append(val)
else:
res.append(None)
return res This gives me 13µs for the example case, which is now almost the same as for the numpy tolist. This might certainly be worth including. Are there ways in cython to avoid having to duplicate this in each of Int8/Int16/Int... Array? The problem is that |
Antoine Pitrou / @pitrou: |
Micah Kornfield / @emkornfield: |
Antoine Pitrou / @pitrou: |
Micah Kornfield / @emkornfield: |
Joris Van den Bossche / @jorisvandenbossche: Personally I would be fine with a more explicit API (the idea would be to add a keyword to those functions to explicitly ask for a pandas object?). But some concerns:
|
Micah Kornfield / @emkornfield: |
Todd Farmer / @toddfarmer: |
Just started using PyArrow. Are there any updates/recommended practices on this issue, this would significantly speed up "iterating through rows" for me. |
It seems that we are 20x slower than Numpy for converting the exact same data to a Python list.
With integers:
With floats:
Reporter: Antoine Pitrou / @pitrou
Related issues:
Note: This issue was originally created as ARROW-12976. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: