Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization of StructColumn #10765

Closed
wants to merge 6 commits into from

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented May 2, 2022

Fixes #10766

@github-actions github-actions bot added the Python Affects Python cuDF API. label May 2, 2022
@madsbk madsbk added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed Python Affects Python cuDF API. labels May 2, 2022
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 2, 2022
@codecov
Copy link

codecov bot commented May 2, 2022

Codecov Report

Merging #10765 (9286aa7) into branch-22.06 (027c34a) will increase coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head 9286aa7 differs from pull request most recent head 6726774. Consider uploading reports for the commit 6726774 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10765      +/-   ##
================================================
+ Coverage         86.40%   86.45%   +0.04%     
================================================
  Files               143      143              
  Lines             22444    22473      +29     
================================================
+ Hits              19393    19428      +35     
+ Misses             3051     3045       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/column.py 89.49% <100.00%> (+0.05%) ⬆️
python/cudf/cudf/core/column/struct.py 97.22% <100.00%> (+0.79%) ⬆️
python/cudf/cudf/core/dataframe.py 93.74% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 89.21% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 92.91% <0.00%> (+0.83%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 027c34a...6726774. Read the comment docs.

@madsbk madsbk marked this pull request as ready for review May 2, 2022 19:20
@madsbk madsbk requested a review from a team as a code owner May 2, 2022 19:20
@madsbk madsbk requested review from trxcllnt and bdice May 2, 2022 19:20
Comment on lines 1043 to 1048
if hasattr(self.dtype, "str"):
# Notice, "dtype" must be availabe for deserialization thus
# if the dtype doesn't support `str` or if it is insufficient
# for deserialization, please overwrite the serialize and/or
# deserialize methods.
header["dtype"] = self.dtype.str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we probably need to move this implementation to NumericalColumn.serialize since each column type is already having their own implementation and it is just NumericalColumn that is using ColumnBase.serialize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for deserialize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should also get rid of the asymmetry where header["dtype"] is only defined for some types during serialization, but is always expected to exist during deserialization.

header["dtype"] = StructDtype.deserialize(*header["dtype"])
sub_frame_offset = header["sub-frame-offset"]
children = []
for h, b in zip(header["sub-headers"], frames[sub_frame_offset:]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to use longer names here for clarity. I don't see the connection between b and subsets of frames. Does it stand for "buffer"?

Suggested change
for h, b in zip(header["sub-headers"], frames[sub_frame_offset:]):
for subheader, subframes in zip(header["sub-headers"], frames[sub_frame_offset:]):

header["dtype"] = self.dtype.serialize()
header["size"] = self.size

header["sub-frame-offset"] = len(frames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this serialization format defined in some other reference documentation? I think this deserves some code comments or an external reference to explain what the sub-frame offset means in the serialized format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't appear to be much detail, since serialize/deserialize are effectively just ad-hoc polymorphic the interpretation of headers and frames is just left up to the pairs for each type.

Copy link
Contributor

@bdice bdice May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that sounds in line with my understanding of the serialization. I do think that the sub-frame offset is a non-obvious part of the struct serialization method. Just a one-sentence comment would help, if this is accurate:

Suggested change
header["sub-frame-offset"] = len(frames)
# The sub-frame-offset denotes the frame index where parent column
# data ends and child column data begins.
header["sub-frame-offset"] = len(frames)

Comment on lines 169 to 170
header["sub-headers"] = sub_headers
header["frame_count"] = len(frames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be consistent with the key format using dashes or underscores instead of mixing those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is canonical in matching the rest of the cudf source ("frame_count" has some meaning outside just this PR), but it's not clear whether there is rationale between hyphens or underscores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost certainly not, but I don't think this PR needs to fix that problem. I think we'd welcome any work to improve the consistency and quality of our serialization logic, but probably fine to push that until we remove the immediate blocker for structs here.

@@ -78,6 +78,13 @@ def test_serialize_struct_dtype(fields):
assert recreated == dtype


def test_serialize_struct():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a nontrivial integration test, e.g. a struct of lists of structs of strings and floats? I think it would make this code much stronger.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read a bunch of the serialization code to understand it a little more, I wonder if there's a way to design it such that this style of ad-hoc polymorphism is not needed, and instead the ColumnBase class can serialize/deserialize all column types in a principled way, or if that just turns into yak-shaving.

I think if the ad-hoc polymorphism needs to stay, then the superclass should probably bail out in a sensible way if children are non-empty.

Comment on lines 169 to 170
header["sub-headers"] = sub_headers
header["frame_count"] = len(frames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is canonical in matching the rest of the cudf source ("frame_count" has some meaning outside just this PR), but it's not clear whether there is rationale between hyphens or underscores.

header["dtype"] = self.dtype.serialize()
header["size"] = self.size

header["sub-frame-offset"] = len(frames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't appear to be much detail, since serialize/deserialize are effectively just ad-hoc polymorphic the interpretation of headers and frames is just left up to the pairs for each type.

@vyasr
Copy link
Contributor

vyasr commented May 4, 2022

Having read a bunch of the serialization code to understand it a little more, I wonder if there's a way to design it such that this style of ad-hoc polymorphism is not needed, and instead the ColumnBase class can serialize/deserialize all column types in a principled way, or if that just turns into yak-shaving.

I think if the ad-hoc polymorphism needs to stay, then the superclass should probably bail out in a sensible way if children are non-empty.

I think we could definitely write our serialization more generically in ColumnBase. I did some analogous work to clean up the serialization logic across different types of Frame in #9305. At the column level it's a little nastier, and it's probably a task that would make sense to push to a future PR. The three main issues that we would need to address in that PR are:

  • We would need to implement a default method for serializing dtypes that aren't extension dtypes. Part of the dichotomy that we currently observe is because our extension dtypes define the serialize method, whereas numpy/pandas dtypes do not. We don't control all dtype objects, so we'll need a suitably dispatched function for this since we can't rely on ducktyping. I believe (but would have to check) that our default serialization of the dtype is just str(dtype), so standardizing this shouldn't be too hard.
  • The ColumnBase implementation should always serialize children. For data types that don't have children this would be a no-op, but child columns are a fundamental concept in the Arrow model so we should be able to standardize that.
  • CategoricalColumn is meaningfully different from our other column types because it is not represented by a libcudf dtype, but is instead composed of two separately cuDF Python columns. Unless and until that changes, we would need to retain the special-casing of this class while still hopefully matching the layout of the other column serializations (as much as possible).

@wence-
Copy link
Contributor

wence- commented May 4, 2022

I think we could definitely write our serialization more generically in ColumnBase. I did some analogous work to clean up the serialization logic across different types of Frame in #9305. At the column level it's a little nastier, and it's probably a task that would make sense to push to a future PR. The three main issues that we would need to address in that PR are:

  • We would need to implement a default method for serializing dtypes that aren't extension dtypes. Part of the dichotomy that we currently observe is because our extension dtypes define the serialize method, whereas numpy/pandas dtypes do not. We don't control all dtype objects, so we'll need a suitably dispatched function for this since we can't rely on ducktyping. I believe (but would have to check) that our default serialization of the dtype is just str(dtype), so standardizing this shouldn't be too hard.
  • The ColumnBase implementation should always serialize children. For data types that don't have children this would be a no-op, but child columns are a fundamental concept in the Arrow model so we should be able to standardize that.
  • CategoricalColumn is meaningfully different from our other column types because it is not represented by a libcudf dtype, but is instead composed of two separately cuDF Python columns. Unless and until that changes, we would need to retain the special-casing of this class while still hopefully matching the layout of the other column serializations (as much as possible).

I had a go at this in #10784, it seems like categorical columns don't need to be treated specially.

@madsbk
Copy link
Member Author

madsbk commented May 4, 2022

Closing in favor of #10784

@madsbk madsbk closed this May 4, 2022
@madsbk madsbk deleted the struct_serialize branch August 8, 2022 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Serialization of StructColumn fails
6 participants