Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pickle protocol 5 with NumPy object arrays #3871

Merged

Conversation

jakirkham
Copy link
Member

Depends on PR ( #3868 )

Builds off the work in PR ( #3784 ) to support pickle protocol 5. Supports out-of-band pickling when serializing NumPy object arrays. This can be useful when serializing ragged arrays for example or really any NumPy object arrays containing data that supports out-of-band pickling.

@@ -23,7 +23,10 @@ def itemsize(dt):
def serialize_numpy_ndarray(x):
if x.dtype.hasobject:
header = {"pickle": True}
frames = [pickle.dumps(x)]
frames = [None]
buffer_callback = lambda f: frames.append(memoryview(f))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we converting into a memoryview here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer_callback is called on values of type PickleBuffer. Here's an example:

In [1]: import pickle                                                           

In [2]: import numpy as np                                                      

In [3]: a = np.array([np.arange(3), np.arange(4, 6)], dtype=object)             

In [4]: l = []; d = pickle.dumps(a, protocol=5, buffer_callback=l.append)       

In [5]: l                                                                       
Out[5]: [<pickle.PickleBuffer at 0x115a64740>, <pickle.PickleBuffer at 0x115a647c0>]

Though PickleBuffer is a wrapper that expects to be passed some bytes-like object, which it wraps zero-copy. In the example above these are the nested NumPy arrays.

However as the rest of the serialization code expects frames to be bytes or memoryviews. We would either need to retool it to support PickleBuffer (with special casing when it is not available) or we can simply convert it to a memoryview, which is also zero-copy.

We've gone for the latter path of converting to a memoryview similar to what was done in PR ( #3784 ). Note this is done without loss of generality.

@jakirkham jakirkham force-pushed the use_pickle_protocol_5_numpy_object_array branch from 1c8e168 to 406dd3b Compare June 9, 2020 17:03
@jakirkham jakirkham changed the title WIP: Use pickle protocol 5 with NumPy object arrays Use pickle protocol 5 with NumPy object arrays Jun 9, 2020
@jakirkham jakirkham marked this pull request as ready for review June 9, 2020 17:11
@jakirkham jakirkham force-pushed the use_pickle_protocol_5_numpy_object_array branch from a65c5ef to e07d26b Compare June 9, 2020 18:57
@jakirkham
Copy link
Member Author

This should be ready on my end. Please let me know if anything else is needed 🙂

@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2020

One more dumb question (sorry for drawing this out): how does this work if the user doesn't have a more modern version of Python? If pickle5 isn't installed and we're using an older version of Python then what happens here?

@jakirkham
Copy link
Member Author

jakirkham commented Jun 9, 2020

Not at all. The discussion is welcome. 🙂 Was more noting this is not in draft status any more.

In the case where pickle protocol 5 is not supported, we would skip collecting buffers when calling pickle.dumps or cloudpickle.dumps and would skip calling the buffer_callback (as there would be no buffers). As a result we would have only one frame here, which results from calling pickle.dumps or cloudpickle.dumps as before. When it comes to unpickle the data, we would have the first frame as an argument to loads and an empty list/tuple for buffers so would just calling pickle.loads without buffers.

TL;DR it behaves the same as the code does before this change.

@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2020

Ah, ok I was thrown by seeing pickle.dumps(...) without realizing that this was our wrapping around pickle. Grand. Thanks for the detailed explanation @jakirkham .

@mrocklin mrocklin merged commit 6ee7f89 into dask:master Jun 9, 2020
@jakirkham jakirkham deleted the use_pickle_protocol_5_numpy_object_array branch June 9, 2020 23:25
@jakirkham
Copy link
Member Author

jakirkham commented Jun 9, 2020

Yeah can see how that would be confusing. Yep, we are importing it here.

Of course! Thanks for the review 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants