-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add serialization methods for List
and StructDtype
#8441
Merged
rapids-bot
merged 10 commits into
rapidsai:branch-21.08
from
charlesbluca:add-dtype-serialization
Jun 21, 2021
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
27ac11d
Add serialization methods for list and struct dtypes
charlesbluca 99192fd
Refactor dtype serialization to always use type-serialized
charlesbluca 52c9054
Use list dtype serialization for list column serialization
charlesbluca b899fb2
Add tests for categorical columns, struct dtypes
charlesbluca 07174e3
Add handling for pyarrow -> cudf decimal type
charlesbluca af342ff
Have dtypes extend Serializable
charlesbluca 7cb557a
Move struct dtype test
charlesbluca b42cb79
Merge remote-tracking branch 'upstream/branch-21.08' into add-dtype-s…
charlesbluca 74c6661
Merge remote-tracking branch 'upstream/branch-21.08' into add-dtype-s…
charlesbluca bf948a1
Relocate serialization functions to correpsonding column tests
charlesbluca File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something seems weird about this. We're making all of our extension dtypes serializable, but I believe we end up needing to override
serialize
anddeserialize
for all of them (ListDtype
,StructDtype
,CategoricalDtype
). To me that suggests either the parent class needs to be generalized to be able to do at least some of the common work between these child classes, or that this inheritance relationship just isn't quite right.I am weakly -1 on doing this as part of this PR. I maybe it makes more sense to add the
serialize
/deserialize
methods in this PR and then refactor the common code out either intoSerializeable
or something that goes in betweenSerializeable
and_BaseDtype
in a separate PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally,
Serializable
was an abstract base class, which forced all derived classes to implementserialize
anddeserialize
. For performance reasons, we disabled that and made it a regular class. Now, derived classes must implementserialize
anddeserialize
, but that is "only" by convention.That being said, there's still very much value in inheriting from
Serializable
, as we get the methodshost_serialize
,device_serialize
,host_deserialize
,device_deserialize
"for free" by the inheritance.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @brandon-b-miller's objection to making this change, but not the reasons.
Serializable
declares an interface, but leaves it up to subclasses to implement it. Whether or not certain subclasses (e.g. all dtypes) can share parts (or all) of that implementation isn't really relevant to whether or not the inheritance pattern makes sense. All that inheriting fromSerializable
does is indicate that if subclasses implementserialize
anddeserialize
, it will be possible to dopickle.dumps(obj)
.All of the
*_(?:de)?serialize
methods just exist to provide hooks intoSerializable.__reduce_ex__
, the method that actually enables serialization. My issue with usingSerializable
for dtype objects is that these hooks are all predicated on the assumption that a subclass ofSerializable
can be decomposed into someheader
of metadata a collection offrames
, which isn't the case for dtypes. If you look at the contents of the methods implemented bySerializable
, they're encoding a bunch of metadata that IMO isn't really appropriate for a dtype, but rather for typed memory buffers (e.g. the length of the array or whether it's stored in device memory).That being the case, I think that it would be simpler and more appropriate to directly implement the pickling protocol (ideally via
__getstate__
and__setstate__
, but if not then via__reduce*
methods) rather than trying to leverageSerializable
. To @brandon-b-miller's point, if some of that logic can be shared between dtypes it would also be great to do that by implementing it at the level of_BaseDtype
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see some merit to @brandon-b-miller's point of making subclasses of
Serializable
that generalize some of the common work that's happening in the serialization function, though I haven't really inspected those functions outside of the dtypes to see if there's a lot of intersection there - were you thinking something likeSerializableDtype
,SerializableFrame
, etc...?To @vyasr's point, I feel like implementing the pickling protocol for the dtypes themselves could result in redundant code, since it would essentially entail copying
Serializable.__reduce_ex__
in_BaseDtype
. Is there a downside to having host/device deserialization implemented for dtypes other than the fact that those functions aren't really appropriate?Also feel like that scenario gives more motivation for making subclasses of
Serializable
, as we could have subclasses that include/exclude the functions we consider inappropriate for their derived classes (such as the host/device serialization).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serializable
is less about making objects picklable and more about serializing objects according to the Dask serialization protocol. The*serialize
methods are absolutely required here in order for dtype objects to be able to be sent efficiently across the wire.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true though that most dtypes really are composed only of metadata. The exception being
CategoricalDtype
, which for compatibility with Pandas, encapsulates also a column of categories (residing on the device).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having read a little more I'm comfortable 👍 -ing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I see now that we're registering
Serializable
's methods todask.distributed
incudf/comm/serialize.py
. It does seem like we could simplify the specifics of the serialization protocol for dtypes since they are (almost) entirely metadata and not data, but for now I think moving forward with this approach is fine for now.