Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769

Conversation

datapythonista
Copy link
Member

A mention in the docs is missing, and I guess we can add pyarrow.Table to the type annotations of what DataFrame accepts, but this should be mostly done otherwise.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add something to the docstring for pd.DataFrame()

@@ -665,6 +666,17 @@ def __init__(
NDFrame.__init__(self, data)
return

try:
Copy link
Member

@phofl phofl Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer pointing user to to_pandas for now. This solution does not respect pandas_metadata, which seems unfortunate. Also this adds quite significant overhead when arrow is not installed (arr is a numpy array):

%timeit pd.DataFrame(arr)
86.6 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit pd.DataFrame(arr)
7.49 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

I'd prefer something like pd.DataFrame.from_arrow or similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting finding. Not sure if necessarily a problem, since the overhead is constant, it doesn't grow with the size or type of data being loaded, seems to be a constant 80 microseconds (in my machine it take the same when pyarrow is not installed, but double than you when it is). I understand the Python interpreter looks in more places when the module is not immediately found, but that's like 6 or 12 times more than when it's found.

In absolute values 80 microseconds constant is not something that seems that bad, but since it's 10 times more than a normal use case, seems like a fair point to avoid. At least until PyArrow is a required dependency (or never, since we probably should deprecate having a constructor that loads everything in favor of a factory approach).

For the metadata, can you clarify which is the metadata not being respected?

Copy link
Member

@phofl phofl Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following should preserve the index:

df = DataFrame({"a": [1, 2, 3]}, index=[1, 2, 3])
table = pyarrow.table(df)
DataFrame(table)

this returns

   a  __index_level_0__
0  1                  1
1  2                  2
2  3                  3

to_pandas does this automatically, we have the same problem in read_parquet and friends, that's why I put up #51766

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, very good point. I didn't think about the index, I see there is some more logic in the transformation: https://github.com/apache/arrow/blob/68e89ac9ee5bf2c0c68493eb9405554162fdfab8/python/pyarrow/pandas_compat.py#L797

I guess it doesn't makes sense to replicate all that logic in our code, since PyArrow already has it. I guess DataFrame.from_arrow as a wrapper to Table.to_pandas (using the same parameters) is what make more sense, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed. Users can use set_index and rename if they want to add an index or rename columns anyway, so no real need for the constructor functionality

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, I'm not sure if it's better to use from_arrow or from_pyarrow. My bet is that it's likely that another Arrow implementation in Python but based in Arrow can be developed eventually. Polars must have reimplemented PyArrow in a way, and whether it's because that code is moved of Polars, or another project is created, I guess it'll end up happening.

Maybe I'm overthinking, but at that point, is it more natural to have something like DataFrame.from_pyarrow and DataFrame.from_arrow2, or just one DataFrame.from_arrow that supports both? Small preference for the former, but happy to hear other opinions.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not inline with other construction methods

in particular the need to import

would be +1 if we eventually require pyarrow as a required but -1 now

@datapythonista datapythonista changed the title ENH: Support Arrow tables in DataFrame constructor ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow Mar 4, 2023
@datapythonista
Copy link
Member Author

Thanks all for the feedback. Changed the approach here, to use DataFrame.from_pyarrow and DataFrame.to_pyarrow.

There are couple of things that could still be added:

  • Showing those methods in the PyArrow user guide
  • Exposing the parameters in pyarrow.Table.from_pandas to DataFrame.to_pyarrow

Also, there is a test failure, but seems like it's an error in assert_frame_equal (or at least the error message doesn't make any sense). But would be great to know that there aren't major objections to this approach before moving forward with this PR.

1 2
2 3
"""
return table.to_pandas(types_mapper=ArrowDtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should do an instance check here ; eg if this is called with a non arrow table you would get a hard to understand error



class TestPyArrow:
@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a skip if pyarrow is not installed


Notes
-----
The conversion is zero-copy, and the resulting DataFrame uses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that the term "zero-copy" is well understood, so maybe elaborate?

@datapythonista
Copy link
Member Author

Thanks all for the feedback, good points. I'll address the comments if there is consensus to move forward with this. @phofl thinks we could maybe just let people use PyArrow methods instead. I opened #51799 to discuss a new API that would allow PyArrow itself to implement the pandas methods instead.

@phofl
Copy link
Member

phofl commented Mar 5, 2023

To clarify, I am not sure about this. Just don’t know if these functions add much value or are more likely to cause confusion?

one benefit is that we can select a good default value for types_mapper, but can’t see much apart from that

@jreback
Copy link
Contributor

jreback commented Mar 5, 2023

yeah would be likely to accept your pdep; this is a bit ad hoc

@datapythonista
Copy link
Member Author

Cool. Let's close this in favor of #51799, which seems a better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a way to convert Arrow tables to Arrow-backed dataframes
4 participants