Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RaggedArray serialization #720

Closed
jonmmease opened this issue Mar 2, 2019 · 3 comments
Closed

RaggedArray serialization #720

jonmmease opened this issue Mar 2, 2019 · 3 comments

Comments

@jonmmease
Copy link
Collaborator

The new RaggedArray pandas ExtensionArray for aggregating variable length lines was added in #687. One remaining issue is that we don't currently have any support for serializing RaggedArray instances to disk.

Ideally it would be possible to save a pandas or dask DataFrame containing RaggedArrays to a parquet file. Perhaps this could be done using a parquet BYTE_ARRAY column, with some column metadata indicating the extension array type.

@TomAugspurger, have you thought much yet about ExtensionArray serialization support? My guess is that there would need to be a separate serialization approach for different storage formats. It would be nice if we could teach pandas and dask about extension arrays stored as raw BYTE_ARRAYs in parquet files, so that the existing to_parquet/read_parquet methods could be used directly.

cc @jbednar

@TomAugspurger
Copy link

We haven't made any real progress on serialization: pandas-dev/pandas#20612. Is parquet the most pressing format for you?

It's a moderately hard problem, since some "columns" (e.g. IPArray) may want to be written as multiple columns on disk. And we'll need to work with both the engines (pyarrow and fastparquet) to implement this correctly. It's not clear to me who will be responsible for what, but I think that the engines should never see an extension array. Rather, pandas would

  1. scan the table for any EAs, dispatching to ExtensionArray._to_parquet(), which would return one or more "columns" appropriate for the backend to write.
  2. Pandas would encode the fact that these columns are actually an extension array in the file metadata, and would use that metadata to faithfully round trip them.

@jorisvandenbossche
Copy link

For parquet (and just speaking about the pyarrow engine), I have been working on this topic, see also pandas-dev/pandas#20612

If you define a conversion of your ExtensionArray to an arrow array type (using __arrow_array__), writing to parquet should already work. In the case of RaggedArrays, if I understand the design correctly, it should more or less directly map to a ListArray in arrow? (flat array of values + offsets)

Reading the parquet file also works, but then converting to pandas (with the appropriate extension array) is still work in progress. See pandas-dev/pandas#20612 (comment) (and https://issues.apache.org/jira/browse/ARROW-2428 linked from there) for some more details on the current ideas. Feedback on that is very welcome.

@ppwadhwa
Copy link
Collaborator

ppwadhwa commented Feb 8, 2021

I believe that serialization is already supported in current datashader releases.

@ppwadhwa ppwadhwa closed this as completed Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants