ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769

datapythonista · 2023-03-03T19:56:24Z

closes Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

A mention in the docs is missing, and I guess we can add pyarrow.Table to the type annotations of what DataFrame accepts, but this should be mostly done otherwise.

Dr-Irv

Might want to add something to the docstring for pd.DataFrame()

phofl · 2023-03-03T20:43:04Z

pandas/core/frame.py

@@ -665,6 +666,17 @@ def __init__(
                NDFrame.__init__(self, data)
                return

+        try:


I'd prefer pointing user to to_pandas for now. This solution does not respect pandas_metadata, which seems unfortunate. Also this adds quite significant overhead when arrow is not installed (arr is a numpy array):

%timeit pd.DataFrame(arr) 86.6 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) %timeit pd.DataFrame(arr) 7.49 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

I'd prefer something like pd.DataFrame.from_arrow or similar

Interesting finding. Not sure if necessarily a problem, since the overhead is constant, it doesn't grow with the size or type of data being loaded, seems to be a constant 80 microseconds (in my machine it take the same when pyarrow is not installed, but double than you when it is). I understand the Python interpreter looks in more places when the module is not immediately found, but that's like 6 or 12 times more than when it's found.

In absolute values 80 microseconds constant is not something that seems that bad, but since it's 10 times more than a normal use case, seems like a fair point to avoid. At least until PyArrow is a required dependency (or never, since we probably should deprecate having a constructor that loads everything in favor of a factory approach).

For the metadata, can you clarify which is the metadata not being respected?

The following should preserve the index:

df = DataFrame({"a": [1, 2, 3]}, index=[1, 2, 3]) table = pyarrow.table(df) DataFrame(table)

this returns

a __index_level_0__ 0 1 1 1 2 2 2 3 3

to_pandas does this automatically, we have the same problem in read_parquet and friends, that's why I put up #51766

Yep, very good point. I didn't think about the index, I see there is some more logic in the transformation: https://github.com/apache/arrow/blob/68e89ac9ee5bf2c0c68493eb9405554162fdfab8/python/pyarrow/pandas_compat.py#L797

I guess it doesn't makes sense to replicate all that logic in our code, since PyArrow already has it. I guess DataFrame.from_arrow as a wrapper to Table.to_pandas (using the same parameters) is what make more sense, no?

Yep agreed. Users can use set_index and rename if they want to add an index or rename columns anyway, so no real need for the constructor functionality

Now that I think about it, I'm not sure if it's better to use from_arrow or from_pyarrow. My bet is that it's likely that another Arrow implementation in Python but based in Arrow can be developed eventually. Polars must have reimplemented PyArrow in a way, and whether it's because that code is moved of Polars, or another project is created, I guess it'll end up happening.

Maybe I'm overthinking, but at that point, is it more natural to have something like DataFrame.from_pyarrow and DataFrame.from_arrow2, or just one DataFrame.from_arrow that supports both? Small preference for the former, but happy to hear other opinions.

jreback

this is not inline with other construction methods

in particular the need to import

would be +1 if we eventually require pyarrow as a required but -1 now

…or_arrow

datapythonista · 2023-03-04T15:20:46Z

Thanks all for the feedback. Changed the approach here, to use DataFrame.from_pyarrow and DataFrame.to_pyarrow.

There are couple of things that could still be added:

Showing those methods in the PyArrow user guide
Exposing the parameters in pyarrow.Table.from_pandas to DataFrame.to_pyarrow

Also, there is a test failure, but seems like it's an error in assert_frame_equal (or at least the error message doesn't make any sense). But would be great to know that there aren't major objections to this approach before moving forward with this PR.

jreback · 2023-03-04T16:21:13Z

pandas/core/frame.py

+        1          2
+        2          3
+        """
+        return table.to_pandas(types_mapper=ArrowDtype)


you should do an instance check here ; eg if this is called with a non arrow table you would get a hard to understand error

jreback · 2023-03-04T16:21:50Z

pandas/tests/frame/constructors/test_pyarrow.py

+
+
+class TestPyArrow:
+    @pytest.mark.parametrize(


need a skip if pyarrow is not installed

Dr-Irv · 2023-03-04T20:28:45Z

pandas/core/frame.py

+
+        Notes
+        -----
+        The conversion is zero-copy, and the resulting DataFrame uses


Not sure that the term "zero-copy" is well understood, so maybe elaborate?

datapythonista · 2023-03-05T20:13:27Z

Thanks all for the feedback, good points. I'll address the comments if there is consensus to move forward with this. @phofl thinks we could maybe just let people use PyArrow methods instead. I opened #51799 to discuss a new API that would allow PyArrow itself to implement the pandas methods instead.

phofl · 2023-03-05T20:16:05Z

To clarify, I am not sure about this. Just don’t know if these functions add much value or are more likely to cause confusion?

one benefit is that we can select a good default value for types_mapper, but can’t see much apart from that

jreback · 2023-03-05T20:27:20Z

yeah would be likely to accept your pdep; this is a bit ad hoc

datapythonista · 2023-03-07T18:23:25Z

Cool. Let's close this in favor of #51799, which seems a better approach.

ENH: Support Arrow tables in DataFrame constructor

1d013db

datapythonista added the Arrow pyarrow functionality label Mar 3, 2023

datapythonista added this to the 2.0 milestone Mar 3, 2023

datapythonista mentioned this pull request Mar 3, 2023

Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

Closed

Dr-Irv reviewed Mar 3, 2023

View reviewed changes

phofl reviewed Mar 3, 2023

View reviewed changes

jreback requested changes Mar 4, 2023

View reviewed changes

datapythonista added 2 commits March 4, 2023 13:02

Merge remote-tracking branch 'upstream/main' into dataframe_construct…

f08f863

…or_arrow

Using from_pyarrow and to_pyarrow instead of DataFrame constructor

04b3897

datapythonista changed the title ~~ENH: Support Arrow tables in DataFrame constructor~~ ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow Mar 4, 2023

Linting

a62e184

jreback requested changes Mar 4, 2023

View reviewed changes

Dr-Irv reviewed Mar 4, 2023

View reviewed changes

datapythonista closed this Mar 7, 2023

simonjayhawkins mentioned this pull request Mar 9, 2023

PDEP-9: Allow third-party projects to register pandas connectors with a standard API #51799

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769

ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769

datapythonista commented Mar 3, 2023

Dr-Irv left a comment

phofl Mar 3, 2023 •

edited

Loading

datapythonista Mar 3, 2023

phofl Mar 3, 2023 •

edited

Loading

datapythonista Mar 3, 2023

phofl Mar 3, 2023

datapythonista Mar 4, 2023

jreback left a comment

datapythonista commented Mar 4, 2023

jreback Mar 4, 2023

jreback Mar 4, 2023

Dr-Irv Mar 4, 2023

datapythonista commented Mar 5, 2023

phofl commented Mar 5, 2023

jreback commented Mar 5, 2023

datapythonista commented Mar 7, 2023

ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769

ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769

Conversation

datapythonista commented Mar 3, 2023

Dr-Irv left a comment

Choose a reason for hiding this comment

phofl Mar 3, 2023 • edited Loading

Choose a reason for hiding this comment

datapythonista Mar 3, 2023

Choose a reason for hiding this comment

phofl Mar 3, 2023 • edited Loading

Choose a reason for hiding this comment

datapythonista Mar 3, 2023

Choose a reason for hiding this comment

phofl Mar 3, 2023

Choose a reason for hiding this comment

datapythonista Mar 4, 2023

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

datapythonista commented Mar 4, 2023

jreback Mar 4, 2023

Choose a reason for hiding this comment

jreback Mar 4, 2023

Choose a reason for hiding this comment

Dr-Irv Mar 4, 2023

Choose a reason for hiding this comment

datapythonista commented Mar 5, 2023

phofl commented Mar 5, 2023

jreback commented Mar 5, 2023

datapythonista commented Mar 7, 2023

phofl Mar 3, 2023 •

edited

Loading

phofl Mar 3, 2023 •

edited

Loading