Serialization of `StructColumn` #10765

madsbk · 2022-05-02T10:52:24Z

python/cudf/cudf/core/column/struct.py

codecov · 2022-05-02T14:23:26Z

Codecov Report

Merging #10765 (9286aa7) into branch-22.06 (027c34a) will increase coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head 9286aa7 differs from pull request most recent head 6726774. Consider uploading reports for the commit 6726774 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10765      +/-   ##
================================================
+ Coverage         86.40%   86.45%   +0.04%     
================================================
  Files               143      143              
  Lines             22444    22473      +29     
================================================
+ Hits              19393    19428      +35     
+ Misses             3051     3045       -6

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/column.py	`89.49% <100.00%> (+0.05%)`	⬆️
python/cudf/cudf/core/column/struct.py	`97.22% <100.00%> (+0.79%)`	⬆️
python/cudf/cudf/core/dataframe.py	`93.74% <0.00%> (+0.04%)`	⬆️
python/cudf/cudf/core/column/string.py	`89.21% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.79% <0.00%> (+0.22%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
python/cudf/cudf/core/column/lists.py	`92.91% <0.00%> (+0.83%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 027c34a...6726774. Read the comment docs.

Co-authored-by: Ashwin Srinath <[email protected]>

galipremsagar · 2022-05-03T01:52:42Z

python/cudf/cudf/core/column/column.py

+        if hasattr(self.dtype, "str"):
+            # Notice, "dtype" must be availabe for deserialization thus
+            # if the dtype doesn't support `str` or if it is insufficient
+            # for deserialization, please overwrite the serialize and/or
+            # deserialize methods.
+            header["dtype"] = self.dtype.str


Hmm, we probably need to move this implementation to NumericalColumn.serialize since each column type is already having their own implementation and it is just NumericalColumn that is using ColumnBase.serialize

Same for deserialize

That should also get rid of the asymmetry where header["dtype"] is only defined for some types during serialization, but is always expected to exist during deserialization.

bdice · 2022-05-03T02:19:35Z

python/cudf/cudf/core/column/struct.py

+        header["dtype"] = StructDtype.deserialize(*header["dtype"])
+        sub_frame_offset = header["sub-frame-offset"]
+        children = []
+        for h, b in zip(header["sub-headers"], frames[sub_frame_offset:]):


I would try to use longer names here for clarity. I don't see the connection between b and subsets of frames. Does it stand for "buffer"?

Suggested change

for h, b in zip(header["sub-headers"], frames[sub_frame_offset:]):

for subheader, subframes in zip(header["sub-headers"], frames[sub_frame_offset:]):

bdice · 2022-05-03T02:22:50Z

python/cudf/cudf/core/column/struct.py

+        header["dtype"] = self.dtype.serialize()
+        header["size"] = self.size
+
+        header["sub-frame-offset"] = len(frames)


Is this serialization format defined in some other reference documentation? I think this deserves some code comments or an external reference to explain what the sub-frame offset means in the serialized format.

There doesn't appear to be much detail, since serialize/deserialize are effectively just ad-hoc polymorphic the interpretation of headers and frames is just left up to the pairs for each type.

Okay, that sounds in line with my understanding of the serialization. I do think that the sub-frame offset is a non-obvious part of the struct serialization method. Just a one-sentence comment would help, if this is accurate:

Suggested change

header["sub-frame-offset"] = len(frames)

# The sub-frame-offset denotes the frame index where parent column

# data ends and child column data begins.

header["sub-frame-offset"] = len(frames)

bdice · 2022-05-03T02:23:05Z

python/cudf/cudf/core/column/struct.py

+        header["sub-headers"] = sub_headers
+        header["frame_count"] = len(frames)


Can we be consistent with the key format using dashes or underscores instead of mixing those?

This is canonical in matching the rest of the cudf source ("frame_count" has some meaning outside just this PR), but it's not clear whether there is rationale between hyphens or underscores.

Almost certainly not, but I don't think this PR needs to fix that problem. I think we'd welcome any work to improve the consistency and quality of our serialization logic, but probably fine to push that until we remove the immediate blocker for structs here.

bdice · 2022-05-03T04:49:31Z

python/cudf/cudf/tests/test_struct.py

@@ -78,6 +78,13 @@ def test_serialize_struct_dtype(fields):
    assert recreated == dtype


+def test_serialize_struct():


Should we add a nontrivial integration test, e.g. a struct of lists of structs of strings and floats? I think it would make this code much stronger.

wence-

Having read a bunch of the serialization code to understand it a little more, I wonder if there's a way to design it such that this style of ad-hoc polymorphism is not needed, and instead the ColumnBase class can serialize/deserialize all column types in a principled way, or if that just turns into yak-shaving.

I think if the ad-hoc polymorphism needs to stay, then the superclass should probably bail out in a sensible way if children are non-empty.

wence- · 2022-05-03T16:52:48Z

python/cudf/cudf/core/column/struct.py

+        header["sub-headers"] = sub_headers
+        header["frame_count"] = len(frames)


This is canonical in matching the rest of the cudf source ("frame_count" has some meaning outside just this PR), but it's not clear whether there is rationale between hyphens or underscores.

wence- · 2022-05-03T17:37:57Z

python/cudf/cudf/core/column/struct.py

+        header["dtype"] = self.dtype.serialize()
+        header["size"] = self.size
+
+        header["sub-frame-offset"] = len(frames)


There doesn't appear to be much detail, since serialize/deserialize are effectively just ad-hoc polymorphic the interpretation of headers and frames is just left up to the pairs for each type.

vyasr · 2022-05-04T00:36:48Z

Having read a bunch of the serialization code to understand it a little more, I wonder if there's a way to design it such that this style of ad-hoc polymorphism is not needed, and instead the ColumnBase class can serialize/deserialize all column types in a principled way, or if that just turns into yak-shaving.

I think if the ad-hoc polymorphism needs to stay, then the superclass should probably bail out in a sensible way if children are non-empty.

I think we could definitely write our serialization more generically in ColumnBase. I did some analogous work to clean up the serialization logic across different types of Frame in #9305. At the column level it's a little nastier, and it's probably a task that would make sense to push to a future PR. The three main issues that we would need to address in that PR are:

We would need to implement a default method for serializing dtypes that aren't extension dtypes. Part of the dichotomy that we currently observe is because our extension dtypes define the serialize method, whereas numpy/pandas dtypes do not. We don't control all dtype objects, so we'll need a suitably dispatched function for this since we can't rely on ducktyping. I believe (but would have to check) that our default serialization of the dtype is just str(dtype), so standardizing this shouldn't be too hard.
The ColumnBase implementation should always serialize children. For data types that don't have children this would be a no-op, but child columns are a fundamental concept in the Arrow model so we should be able to standardize that.
CategoricalColumn is meaningfully different from our other column types because it is not represented by a libcudf dtype, but is instead composed of two separately cuDF Python columns. Unless and until that changes, we would need to retain the special-casing of this class while still hopefully matching the layout of the other column serializations (as much as possible).

wence- · 2022-05-04T13:36:12Z

I think we could definitely write our serialization more generically in ColumnBase. I did some analogous work to clean up the serialization logic across different types of Frame in #9305. At the column level it's a little nastier, and it's probably a task that would make sense to push to a future PR. The three main issues that we would need to address in that PR are:

We would need to implement a default method for serializing dtypes that aren't extension dtypes. Part of the dichotomy that we currently observe is because our extension dtypes define the serialize method, whereas numpy/pandas dtypes do not. We don't control all dtype objects, so we'll need a suitably dispatched function for this since we can't rely on ducktyping. I believe (but would have to check) that our default serialization of the dtype is just str(dtype), so standardizing this shouldn't be too hard.

The ColumnBase implementation should always serialize children. For data types that don't have children this would be a no-op, but child columns are a fundamental concept in the Arrow model so we should be able to standardize that.

CategoricalColumn is meaningfully different from our other column types because it is not represented by a libcudf dtype, but is instead composed of two separately cuDF Python columns. Unless and until that changes, we would need to retain the special-casing of this class while still hopefully matching the layout of the other column serializations (as much as possible).

I had a go at this in #10784, it seems like categorical columns don't need to be treated specially.

madsbk · 2022-05-04T14:03:02Z

Closing in favor of #10784

Impl. a test

18ab889

github-actions bot added the Python Affects Python cuDF API. label May 2, 2022

madsbk added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed Python Affects Python cuDF API. labels May 2, 2022

madsbk added 2 commits May 2, 2022 14:36

make column serialize/deserialize more robust

6cde1a3

StructColumn: impl. serialize and deserialize

925d388

github-actions bot added the Python Affects Python cuDF API. label May 2, 2022

shwina reviewed May 2, 2022

View reviewed changes

python/cudf/cudf/core/column/struct.py Outdated Show resolved Hide resolved

madsbk and others added 2 commits May 2, 2022 16:41

Use cls.dtype.__class__

67f0247

Co-authored-by: Ashwin Srinath <[email protected]>

StructColumn does not have the static valiabe dtype

47bc22f

madsbk marked this pull request as ready for review May 2, 2022 19:20

madsbk requested a review from a team as a code owner May 2, 2022 19:20

madsbk requested review from trxcllnt and bdice May 2, 2022 19:20

galipremsagar reviewed May 3, 2022

View reviewed changes

bdice reviewed May 3, 2022

View reviewed changes

wence- reviewed May 3, 2022

View reviewed changes

Make column serialize more general

6726774

wence- mentioned this pull request May 4, 2022

Generic serialization of all column types #10784

Merged

madsbk closed this May 4, 2022

madsbk deleted the struct_serialize branch August 8, 2022 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization of `StructColumn` #10765

Serialization of `StructColumn` #10765

madsbk commented May 2, 2022 •

edited

Loading

codecov bot commented May 2, 2022 •

edited

Loading

galipremsagar May 3, 2022

galipremsagar May 3, 2022

bdice May 3, 2022

bdice May 3, 2022

bdice May 3, 2022

wence- May 3, 2022

bdice May 3, 2022 •

edited

Loading

bdice May 3, 2022

wence- May 3, 2022

vyasr May 4, 2022

bdice May 3, 2022

wence- left a comment

wence- May 3, 2022

wence- May 3, 2022

vyasr commented May 4, 2022

wence- commented May 4, 2022

madsbk commented May 4, 2022

	for h, b in zip(header["sub-headers"], frames[sub_frame_offset:]):
	for subheader, subframes in zip(header["sub-headers"], frames[sub_frame_offset:]):

		header["sub-headers"] = sub_headers
		header["frame_count"] = len(frames)

		@@ -78,6 +78,13 @@ def test_serialize_struct_dtype(fields):
		assert recreated == dtype


		def test_serialize_struct():

Serialization of StructColumn #10765

Serialization of StructColumn #10765

Conversation

madsbk commented May 2, 2022 • edited Loading

codecov bot commented May 2, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice May 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr commented May 4, 2022

wence- commented May 4, 2022

madsbk commented May 4, 2022

Serialization of `StructColumn` #10765

Serialization of `StructColumn` #10765

madsbk commented May 2, 2022 •

edited

Loading

codecov bot commented May 2, 2022 •

edited

Loading

bdice May 3, 2022 •

edited

Loading