-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH]: serialization schema cleanup #10799
Comments
Going through and doing the minimum thing to add Some of the metadata slots (in particular Sometimes, properties of nested objects are copied in to the parent header, and sometimes not, I think it makes sense to clean up and have a model of:
Perhaps something like this was considered and rejected? A much larger change would be to set up all of the serializable objects as |
A small consideration that might tip the balance: using underscores makes it map more directly to valid Python identifiers, e.g. # Using underscores in the keys:
frame_count = metadata["frame_count"]
type_serialized = metadata["type_serialized"]
# vs. mapping dashes to underscores:
frame_count = metadata["frame-count"]
type_serialized = metadata["type-serialized"] |
I would definitely welcome more insight from a Dask expert. Some thoughts and questions:
It might help to consider whether we could change our classes so that only
|
Dask uses Not sure I follow what else is being proposed here |
@jakirkham I think the two main questions for you are:
|
Yes When adding support for cuDF serialization, we found all sorts of objects went over the wire. Any we missed supporting surfaced as errors in benchmarks. So we added them all I think what I'm missing is what we are trying to fix here |
serialize
metadata keys
On the contrary. I'm proposing: header = {"properties": {}}
frames = []
sub_obj = self.child # object we're serializing has a child to be serialized
sub_header, sub_frames = sub_obj.serialize()
header["properties"]["child"] = sub_header
frames.extend(sub_frames) At the moment, depending on the particular object, in serialization, sometimes this is done, sometimes some information is carried redundantly in the
This is not problematic. Deserialization takes a (nested) metadata descriptor and a list of frames and returns a deserialized object and a (partially) consumed list of frames. So a helper function: def unpack(header, frames):
typ = pickle.loads(header["type-serialized"])
count = header["frame_count"]
obj = typ.deserialize(header, frames[:count])
return obj, frames[count:] works to unfold the part of a nested definition. So suppose we were deserializing a column with a categorical dtype: dtype_header = header["properties"]["dtype"]
dtype, frames = unpack(dtype_header, frames)
# continue with deserialization of other properties
One way to square that circle (though it is a big API-breaking change) is to split the munging of data for
EDIT: that's not possible due to API constraints (as pointed out below by @shwina).
The advantage of everything supporting the same interface is you don't need to do any special-casing. You just recurse calling serialize until the base case is hit. If you don't have this then any dtype-carrying object that needs to be serialized has to
I think this would work, since the wire format is to effectively send all the frames out of band and the reconstruct on the other end. The column metadata can include enough information to rebuild/validate the buffer. |
Initially, I was adding support for serialization that was missing on struct columns (that was #10784). As part of that, the schema for the metadata headers seemed a bit inconsistent. So I am looking at if it is worth investing effort in cleaning that up a bit. |
Just a drive-by comment here: |
I'm guessing because code basically relies on |
Yes -- and also there are classmethods defined on |
This issue has been labeled |
This issue has been labeled |
I think anything we do here will need to be in tandem with proposed serialisation changes in dask/distributed that are being contemplated. So I'll revisit this then. |
@wence- has anything changed in dask since the last comment to move the needle here? |
I'm a bit out of the loop. I think that they moved to allowing pickle/unpickle. But that doesn't fundamentally change things (since that works with gpu-backed data but necessitates a device-host transfer). |
I think this is probably not worth it for the code churn, FWIW. |
Followup from #10784. Hyphens and underscores are used inconsistently when separating names in metadata keys in
serialize
; go through and standardise on one choice (hyphens seem more popular).The text was updated successfully, but these errors were encountered: