-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769
ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add something to the docstring for pd.DataFrame()
pandas/core/frame.py
Outdated
@@ -665,6 +666,17 @@ def __init__( | |||
NDFrame.__init__(self, data) | |||
return | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer pointing user to to_pandas
for now. This solution does not respect pandas_metadata
, which seems unfortunate. Also this adds quite significant overhead when arrow is not installed (arr is a numpy array):
%timeit pd.DataFrame(arr)
86.6 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit pd.DataFrame(arr)
7.49 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
I'd prefer something like pd.DataFrame.from_arrow or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting finding. Not sure if necessarily a problem, since the overhead is constant, it doesn't grow with the size or type of data being loaded, seems to be a constant 80 microseconds (in my machine it take the same when pyarrow is not installed, but double than you when it is). I understand the Python interpreter looks in more places when the module is not immediately found, but that's like 6 or 12 times more than when it's found.
In absolute values 80 microseconds constant is not something that seems that bad, but since it's 10 times more than a normal use case, seems like a fair point to avoid. At least until PyArrow is a required dependency (or never, since we probably should deprecate having a constructor that loads everything in favor of a factory approach).
For the metadata, can you clarify which is the metadata not being respected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following should preserve the index:
df = DataFrame({"a": [1, 2, 3]}, index=[1, 2, 3])
table = pyarrow.table(df)
DataFrame(table)
this returns
a __index_level_0__
0 1 1
1 2 2
2 3 3
to_pandas does this automatically, we have the same problem in read_parquet and friends, that's why I put up #51766
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, very good point. I didn't think about the index, I see there is some more logic in the transformation: https://github.com/apache/arrow/blob/68e89ac9ee5bf2c0c68493eb9405554162fdfab8/python/pyarrow/pandas_compat.py#L797
I guess it doesn't makes sense to replicate all that logic in our code, since PyArrow already has it. I guess DataFrame.from_arrow
as a wrapper to Table.to_pandas
(using the same parameters) is what make more sense, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep agreed. Users can use set_index and rename if they want to add an index or rename columns anyway, so no real need for the constructor functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, I'm not sure if it's better to use from_arrow
or from_pyarrow
. My bet is that it's likely that another Arrow implementation in Python but based in Arrow can be developed eventually. Polars must have reimplemented PyArrow in a way, and whether it's because that code is moved of Polars, or another project is created, I guess it'll end up happening.
Maybe I'm overthinking, but at that point, is it more natural to have something like DataFrame.from_pyarrow
and DataFrame.from_arrow2
, or just one DataFrame.from_arrow
that supports both? Small preference for the former, but happy to hear other opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not inline with other construction methods
in particular the need to import
would be +1 if we eventually require pyarrow as a required but -1 now
Thanks all for the feedback. Changed the approach here, to use There are couple of things that could still be added:
Also, there is a test failure, but seems like it's an error in |
1 2 | ||
2 3 | ||
""" | ||
return table.to_pandas(types_mapper=ArrowDtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should do an instance check here ; eg if this is called with a non arrow table you would get a hard to understand error
|
||
|
||
class TestPyArrow: | ||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a skip if pyarrow is not installed
|
||
Notes | ||
----- | ||
The conversion is zero-copy, and the resulting DataFrame uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that the term "zero-copy" is well understood, so maybe elaborate?
Thanks all for the feedback, good points. I'll address the comments if there is consensus to move forward with this. @phofl thinks we could maybe just let people use PyArrow methods instead. I opened #51799 to discuss a new API that would allow PyArrow itself to implement the pandas methods instead. |
To clarify, I am not sure about this. Just don’t know if these functions add much value or are more likely to cause confusion? one benefit is that we can select a good default value for types_mapper, but can’t see much apart from that |
yeah would be likely to accept your pdep; this is a bit ad hoc |
Cool. Let's close this in favor of #51799, which seems a better approach. |
A mention in the docs is missing, and I guess we can add
pyarrow.Table
to the type annotations of whatDataFrame
accepts, but this should be mostly done otherwise.