-
-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RaggedArray serialization #720
Comments
We haven't made any real progress on serialization: pandas-dev/pandas#20612. Is parquet the most pressing format for you? It's a moderately hard problem, since some "columns" (e.g. IPArray) may want to be written as multiple columns on disk. And we'll need to work with both the engines (pyarrow and fastparquet) to implement this correctly. It's not clear to me who will be responsible for what, but I think that the engines should never see an extension array. Rather, pandas would
|
For parquet (and just speaking about the pyarrow engine), I have been working on this topic, see also pandas-dev/pandas#20612 If you define a conversion of your ExtensionArray to an arrow array type (using Reading the parquet file also works, but then converting to pandas (with the appropriate extension array) is still work in progress. See pandas-dev/pandas#20612 (comment) (and https://issues.apache.org/jira/browse/ARROW-2428 linked from there) for some more details on the current ideas. Feedback on that is very welcome. |
I believe that serialization is already supported in current datashader releases. |
The new
RaggedArray
pandasExtensionArray
for aggregating variable length lines was added in #687. One remaining issue is that we don't currently have any support for serializingRaggedArray
instances to disk.Ideally it would be possible to save a pandas or dask
DataFrame
containingRaggedArray
s to a parquet file. Perhaps this could be done using a parquetBYTE_ARRAY
column, with some column metadata indicating the extension array type.@TomAugspurger, have you thought much yet about
ExtensionArray
serialization support? My guess is that there would need to be a separate serialization approach for different storage formats. It would be nice if we could teach pandas and dask about extension arrays stored as rawBYTE_ARRAY
s in parquet files, so that the existingto_parquet
/read_parquet
methods could be used directly.cc @jbednar
The text was updated successfully, but these errors were encountered: