RaggedArray serialization #720

jonmmease · 2019-03-02T15:40:10Z

The new RaggedArray pandas ExtensionArray for aggregating variable length lines was added in #687. One remaining issue is that we don't currently have any support for serializing RaggedArray instances to disk.

Ideally it would be possible to save a pandas or dask DataFrame containing RaggedArrays to a parquet file. Perhaps this could be done using a parquet BYTE_ARRAY column, with some column metadata indicating the extension array type.

@TomAugspurger, have you thought much yet about ExtensionArray serialization support? My guess is that there would need to be a separate serialization approach for different storage formats. It would be nice if we could teach pandas and dask about extension arrays stored as raw BYTE_ARRAYs in parquet files, so that the existing to_parquet/read_parquet methods could be used directly.

cc @jbednar

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-03-03T20:36:48Z

We haven't made any real progress on serialization: pandas-dev/pandas#20612. Is parquet the most pressing format for you?

It's a moderately hard problem, since some "columns" (e.g. IPArray) may want to be written as multiple columns on disk. And we'll need to work with both the engines (pyarrow and fastparquet) to implement this correctly. It's not clear to me who will be responsible for what, but I think that the engines should never see an extension array. Rather, pandas would

scan the table for any EAs, dispatching to ExtensionArray._to_parquet(), which would return one or more "columns" appropriate for the backend to write.
Pandas would encode the fact that these columns are actually an extension array in the file metadata, and would use that metadata to faithfully round trip them.

jorisvandenbossche · 2019-10-02T15:32:22Z

For parquet (and just speaking about the pyarrow engine), I have been working on this topic, see also pandas-dev/pandas#20612

If you define a conversion of your ExtensionArray to an arrow array type (using __arrow_array__), writing to parquet should already work. In the case of RaggedArrays, if I understand the design correctly, it should more or less directly map to a ListArray in arrow? (flat array of values + offsets)

Reading the parquet file also works, but then converting to pandas (with the appropriate extension array) is still work in progress. See pandas-dev/pandas#20612 (comment) (and https://issues.apache.org/jira/browse/ARROW-2428 linked from there) for some more details on the current ideas. Feedback on that is very welcome.

ppwadhwa · 2021-02-08T16:40:10Z

I believe that serialization is already supported in current datashader releases.

jonmmease mentioned this issue Oct 23, 2019

Add Polygon support #819

Closed

ppwadhwa closed this as completed Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RaggedArray serialization #720

RaggedArray serialization #720

jonmmease commented Mar 2, 2019

TomAugspurger commented Mar 3, 2019

jorisvandenbossche commented Oct 2, 2019

ppwadhwa commented Feb 8, 2021

RaggedArray serialization #720

RaggedArray serialization #720

Comments

jonmmease commented Mar 2, 2019

TomAugspurger commented Mar 3, 2019

jorisvandenbossche commented Oct 2, 2019

ppwadhwa commented Feb 8, 2021