-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization of StructColumn
#10765
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10765 +/- ##
================================================
+ Coverage 86.40% 86.45% +0.04%
================================================
Files 143 143
Lines 22444 22473 +29
================================================
+ Hits 19393 19428 +35
+ Misses 3051 3045 -6
Continue to review full report at Codecov.
|
Co-authored-by: Ashwin Srinath <[email protected]>
if hasattr(self.dtype, "str"): | ||
# Notice, "dtype" must be availabe for deserialization thus | ||
# if the dtype doesn't support `str` or if it is insufficient | ||
# for deserialization, please overwrite the serialize and/or | ||
# deserialize methods. | ||
header["dtype"] = self.dtype.str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we probably need to move this implementation to NumericalColumn.serialize
since each column type is already having their own implementation and it is just NumericalColumn
that is using ColumnBase.serialize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for deserialize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should also get rid of the asymmetry where header["dtype"]
is only defined for some types during serialization, but is always expected to exist during deserialization.
header["dtype"] = StructDtype.deserialize(*header["dtype"]) | ||
sub_frame_offset = header["sub-frame-offset"] | ||
children = [] | ||
for h, b in zip(header["sub-headers"], frames[sub_frame_offset:]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to use longer names here for clarity. I don't see the connection between b
and subsets of frames. Does it stand for "buffer"?
for h, b in zip(header["sub-headers"], frames[sub_frame_offset:]): | |
for subheader, subframes in zip(header["sub-headers"], frames[sub_frame_offset:]): |
header["dtype"] = self.dtype.serialize() | ||
header["size"] = self.size | ||
|
||
header["sub-frame-offset"] = len(frames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this serialization format defined in some other reference documentation? I think this deserves some code comments or an external reference to explain what the sub-frame offset means in the serialized format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't appear to be much detail, since serialize/deserialize
are effectively just ad-hoc polymorphic the interpretation of headers and frames is just left up to the pairs for each type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that sounds in line with my understanding of the serialization. I do think that the sub-frame offset is a non-obvious part of the struct serialization method. Just a one-sentence comment would help, if this is accurate:
header["sub-frame-offset"] = len(frames) | |
# The sub-frame-offset denotes the frame index where parent column | |
# data ends and child column data begins. | |
header["sub-frame-offset"] = len(frames) |
header["sub-headers"] = sub_headers | ||
header["frame_count"] = len(frames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be consistent with the key format using dashes or underscores instead of mixing those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is canonical in matching the rest of the cudf source ("frame_count"
has some meaning outside just this PR), but it's not clear whether there is rationale between hyphens or underscores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost certainly not, but I don't think this PR needs to fix that problem. I think we'd welcome any work to improve the consistency and quality of our serialization logic, but probably fine to push that until we remove the immediate blocker for structs here.
@@ -78,6 +78,13 @@ def test_serialize_struct_dtype(fields): | |||
assert recreated == dtype | |||
|
|||
|
|||
def test_serialize_struct(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a nontrivial integration test, e.g. a struct of lists of structs of strings and floats? I think it would make this code much stronger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having read a bunch of the serialization code to understand it a little more, I wonder if there's a way to design it such that this style of ad-hoc polymorphism is not needed, and instead the ColumnBase
class can serialize/deserialize all column types in a principled way, or if that just turns into yak-shaving.
I think if the ad-hoc polymorphism needs to stay, then the superclass should probably bail out in a sensible way if children
are non-empty.
header["sub-headers"] = sub_headers | ||
header["frame_count"] = len(frames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is canonical in matching the rest of the cudf source ("frame_count"
has some meaning outside just this PR), but it's not clear whether there is rationale between hyphens or underscores.
header["dtype"] = self.dtype.serialize() | ||
header["size"] = self.size | ||
|
||
header["sub-frame-offset"] = len(frames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't appear to be much detail, since serialize/deserialize
are effectively just ad-hoc polymorphic the interpretation of headers and frames is just left up to the pairs for each type.
I think we could definitely write our serialization more generically in
|
I had a go at this in #10784, it seems like categorical columns don't need to be treated specially. |
Closing in favor of #10784 |
Fixes #10766