Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: serialization schema cleanup #10799

Closed
wence- opened this issue May 5, 2022 · 17 comments
Closed

[ENH]: serialization schema cleanup #10799

wence- opened this issue May 5, 2022 · 17 comments
Assignees
Labels
Python Affects Python cuDF API. question Further information is requested

Comments

@wence-
Copy link
Contributor

wence- commented May 5, 2022

Followup from #10784. Hyphens and underscores are used inconsistently when separating names in metadata keys in serialize; go through and standardise on one choice (hyphens seem more popular).

@wence- wence- self-assigned this May 9, 2022
@wence-
Copy link
Contributor Author

wence- commented May 9, 2022

Going through and doing the minimum thing to add frame_count slots to all serializable objects, a further thought occurred which is that as well as lack of consistency in key names, there's also a lack of consistency in the metadata schema.

Some of the metadata slots (in particular type-serialized) are picked to align with what dask names things; as far as I can tell this is not the case for others, and doesn't really need to be, but cc @jakirkham whom git-annotate suggests might know.

Sometimes, properties of nested objects are copied in to the parent header, and sometimes not, I think it makes sense to clean up and have a model of:

{"type-serialized": my-type,
  "properties": {attribute: value, nested_object: nested_stuff},
  "frame-count": number_of_frames_consumed_in_deserialization}

Perhaps something like this was considered and rejected?

A much larger change would be to set up all of the serializable objects as dataclass-like things with constructors that just set attributes, then the schema for serialization is completely clear and lots of this code can be removed. A downside here is that serialization is not as extensible in an ad-hoc manner, and I am not sure that all cudf classes can get away with default simple constructors.

@wence- wence- added the question Further information is requested label May 9, 2022
@bdice
Copy link
Contributor

bdice commented May 9, 2022

A small consideration that might tip the balance: using underscores makes it map more directly to valid Python identifiers, e.g.

# Using underscores in the keys:
frame_count = metadata["frame_count"]
type_serialized = metadata["type_serialized"]
# vs. mapping dashes to underscores:
frame_count = metadata["frame-count"]
type_serialized = metadata["type-serialized"]

@vyasr
Copy link
Contributor

vyasr commented May 9, 2022

I would definitely welcome more insight from a Dask expert. Some thoughts and questions:

  • Are you proposing that nested headers are always copied into their parents? Does that mean that we always duplicate data?
  • How would nesting work when nested objects that aren't buffers actually return frames? In particular, our CategoricalDtype serializes into both header information and a frame consisting of the categories (just curious if you've considered how this impacts your proposal, because I haven't).
  • You're right that simple dataclass constructors won't work for most of our classes like DataFrame or Series. Those must be constructible from a wide range of objects to match what pandas supports.

It might help to consider whether we could change our classes so that only Frame and BaseIndex are Serializable. ColumnBase would still need its current implementation, but we could simplify dtype and Buffer logic. We shouldn't need non-API classes to conform to the dask.distributed API, and it leads to some incongruities:

  • dtypes include "frames" when serializing even though they don't actually have frames of data. The exception is categoricals, which are tricky because they are the only dtype that stores data on device (the categories). We would probably need a special case to support those.
  • Buffers ultimately just insert "self" into the frames, so every Column could really just insert its underlying buffer into the frames list and call that good. Buffer is an implementation detail of cudf and shouldn't need to conform to the dask API. We really just use the CAI protocol to save and load them anyway. That would remove one additional serialize implementation.

@jakirkham
Copy link
Member

Dask uses -s

Not sure I follow what else is being proposed here

@vyasr
Copy link
Contributor

vyasr commented May 9, 2022

@jakirkham I think the two main questions for you are:

  • Does it matters at all for cudf to use the same field names as dask (such as type-serialized)? We think not, but would like confirmation.
  • What objects actually need to support the serialization protocols? Is it just DataFrame,Series, and all the types of Indexes? i.e. Is there any need for cudf.Buffer or each dtype to support the protocols? Of course, we need a way to serialize that information, i.e. we need to know whether a Series contains strings or ints, but here we're just talking about necessary APIs for each type of object.

@jakirkham
Copy link
Member

Yes "type-serialized" matters. It is a special field in Dask

When adding support for cuDF serialization, we found all sorts of objects went over the wire. Any we missed supporting surfaced as errors in benchmarks. So we added them all

I think what I'm missing is what we are trying to fix here

@wence- wence- changed the title [ENH]: Consistently use hyphens in serialize metadata keys [ENH]: serialization schema cleanup May 10, 2022
@wence-
Copy link
Contributor Author

wence- commented May 10, 2022

I would definitely welcome more insight from a Dask expert. Some thoughts and questions:

  • Are you proposing that nested headers are always copied into their parents? Does that mean that we always duplicate data?

On the contrary. I'm proposing:

header = {"properties": {}}
frames = []
sub_obj = self.child # object we're serializing has a child to be serialized
sub_header, sub_frames = sub_obj.serialize()
header["properties"]["child"] = sub_header
frames.extend(sub_frames)

At the moment, depending on the particular object, in serialization, sometimes this is done, sometimes some information is carried redundantly in the header itself. Moreover, if adding a new slot to the "top-level" header key space, one has to read (or know) non-local code to know whether there are any reserved keys. For example Serializable.device_serialize (which calls the class-implemented serialize) overwrites "type-serialized", "is-cuda", and "lengths", Serializable.host_serialize (called via pickle) additionally overwrites "writeable".

  • How would nesting work when nested objects that aren't buffers actually return frames? In particular, our CategoricalDtype serializes into both header information and a frame consisting of the categories (just curious if you've considered how this impacts your proposal, because I haven't).

This is not problematic. Deserialization takes a (nested) metadata descriptor and a list of frames and returns a deserialized object and a (partially) consumed list of frames. So a helper function:

def unpack(header, frames):
    typ = pickle.loads(header["type-serialized"])
    count = header["frame_count"]
    obj = typ.deserialize(header, frames[:count])
    return obj, frames[count:]

works to unfold the part of a nested definition. So suppose we were deserializing a column with a categorical dtype:

dtype_header = header["properties"]["dtype"]
dtype, frames = unpack(dtype_header, frames)
# continue with deserialization of other properties
  • You're right that simple dataclass constructors won't work for most of our classes like DataFrame or Series. Those must be constructible from a wide range of objects to match what pandas supports.

One way to square that circle (though it is a big API-breaking change) is to split the munging of data for __init__ into a free function. That is the API offers:

def DataFrame(args):
    # munge args
    processed_args = ...(args)
    return impl.DataFrame(processed_args)

It may well not be worth it, however.

EDIT: that's not possible due to API constraints (as pointed out below by @shwina).

It might help to consider whether we could change our classes so that only Frame and BaseIndex are Serializable. ColumnBase would still need its current implementation, but we could simplify dtype and Buffer logic. We shouldn't need non-API classes to conform to the dask.distributed API, and it leads to some incongruities:

  • dtypes include "frames" when serializing even though they don't actually have frames of data. The exception is categoricals, which are tricky because they are the only dtype that stores data on device (the categories). We would probably need a special case to support those.

The advantage of everything supporting the same interface is you don't need to do any special-casing. You just recurse calling serialize until the base case is hit. If you don't have this then any dtype-carrying object that needs to be serialized has to if isinstance(dtype, CategoricalDtype) I think.

  • Buffers ultimately just insert "self" into the frames, so every Column could really just insert its underlying buffer into the frames list and call that good. Buffer is an implementation detail of cudf and shouldn't need to conform to the dask API. We really just use the CAI protocol to save and load them anyway. That would remove one additional serialize implementation.

I think this would work, since the wire format is to effectively send all the frames out of band and the reconstruct on the other end. The column metadata can include enough information to rebuild/validate the buffer.

@wence-
Copy link
Contributor Author

wence- commented May 10, 2022

I think what I'm missing is what we are trying to fix here

Initially, I was adding support for serialization that was missing on struct columns (that was #10784). As part of that, the schema for the metadata headers seemed a bit inconsistent. So I am looking at if it is worth investing effort in cleaning that up a bit.

@shwina
Copy link
Contributor

shwina commented May 11, 2022

Just a drive-by comment here: DataFrame needs to be a type (as does Index). Unfortunately, we cannot make those into factory free-functions.

@wence-
Copy link
Contributor Author

wence- commented May 11, 2022

Just a drive-by comment here: DataFrame needs to be a type (as does Index). Unfortunately, we cannot make those into factory free-functions.

I'm guessing because code basically relies on isinstance(foo, cudf.DataFrame) and the like?

@shwina
Copy link
Contributor

shwina commented May 11, 2022

Yes -- and also there are classmethods defined on cudf.DataFrame that are in the public API; e.g., cudf.DataFrame.from_pandas().

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Sep 8, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball GregoryKimball added Python Affects Python cuDF API. and removed inactive-30d labels Nov 21, 2022
@wence-
Copy link
Contributor Author

wence- commented Nov 28, 2022

I think anything we do here will need to be in tandem with proposed serialisation changes in dask/distributed that are being contemplated. So I'll revisit this then.

@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

@wence- has anything changed in dask since the last comment to move the needle here?

@wence-
Copy link
Contributor Author

wence- commented Jul 31, 2024

I'm a bit out of the loop. I think that they moved to allowing pickle/unpickle. But that doesn't fundamentally change things (since that works with gpu-backed data but necessitates a device-host transfer).

@wence-
Copy link
Contributor Author

wence- commented Sep 30, 2024

I think this is probably not worth it for the code churn, FWIW.

@wence- wence- closed this as completed Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Python Affects Python cuDF API. question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants