Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic serialization of all column types #10784

Merged

Conversation

wence-
Copy link
Contributor

@wence- wence- commented May 4, 2022

Prior to this change, not all column types were serializable, with
serialization implemented in an ad-hoc way for each type in turn. The
inheritance is such that we can just implement generic serialization on the
base column class (no subclass has a specialized constructor), so do
that (closes #10766).

To support this implement serialization for all dtypes. These must
individually implement the Serializable interface since in contrast to
columns every dtype has its own constructor (closes #10785).

@github-actions github-actions bot added the Python Affects Python cuDF API. label May 4, 2022
@madsbk madsbk added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 4, 2022
@codecov
Copy link

codecov bot commented May 4, 2022

Codecov Report

Merging #10784 (2171cd7) into branch-22.06 (8d861ce) will decrease coverage by 0.01%.
The diff coverage is 98.40%.

❗ Current head 2171cd7 differs from pull request most recent head 94bc231. Consider uploading reports for the commit 94bc231 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10784      +/-   ##
================================================
- Coverage         86.40%   86.38%   -0.02%     
================================================
  Files               143      143              
  Lines             22448    22442       -6     
================================================
- Hits              19396    19387       -9     
- Misses             3052     3055       +3     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/lists.py 90.68% <ø> (-1.40%) ⬇️
python/cudf/cudf/core/column/string.py 88.66% <ø> (-0.44%) ⬇️
python/cudf/cudf/core/indexed_frame.py 91.70% <ø> (ø)
python/cudf/cudf/core/dataframe.py 93.73% <96.29%> (+0.03%) ⬆️
python/cudf/cudf/core/buffer.py 88.52% <100.00%> (+0.19%) ⬆️
python/cudf/cudf/core/column/categorical.py 89.37% <100.00%> (-0.61%) ⬇️
python/cudf/cudf/core/column/column.py 89.76% <100.00%> (+0.32%) ⬆️
python/cudf/cudf/core/column/decimal.py 90.60% <100.00%> (-0.50%) ⬇️
python/cudf/cudf/core/dtypes.py 97.19% <100.00%> (-0.11%) ⬇️
python/cudf/cudf/testing/_utils.py 94.05% <100.00%> (+0.06%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d861ce...94bc231. Read the comment docs.

python/cudf/cudf/core/dtypes.py Show resolved Hide resolved
python/cudf/cudf/tests/test_serialize.py Show resolved Hide resolved
python/cudf/cudf/tests/test_serialize.py Show resolved Hide resolved
python/cudf/cudf/core/dtypes.py Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
wence- added 3 commits May 4, 2022 10:16
We can't defer to the superclass serialization routines since the
closed-ness of the interval will be unhandled. Closes rapidsai#10785.
Will remove the need for ad-hoc handling of dtypes when serializing
columns.
@wence- wence- force-pushed the wence/column-serialize-cleanup branch from c827bf5 to f159f63 Compare May 4, 2022 17:17
@vyasr
Copy link
Contributor

vyasr commented May 5, 2022

@wence- thanks for this! I didn't get around to taking a look today, but I should be able to tomorrow.

Handle children and data types in a principled way, removing need
for special-casing in subclasses.
@wence- wence- force-pushed the wence/column-serialize-cleanup branch from f159f63 to 2cb389a Compare May 5, 2022 08:35
@wence- wence- marked this pull request as ready for review May 5, 2022 08:36
@wence- wence- requested a review from a team as a code owner May 5, 2022 08:36
@wence- wence- requested review from vyasr and skirui-source May 5, 2022 08:36
@wence- wence- changed the title WIP: Column serialize cleanup Column serialize cleanup May 5, 2022
@wence-
Copy link
Contributor Author

wence- commented May 5, 2022

I think this is now ready for review. Some points to note:

  1. The applied "non-breaking" label is incorrect; this changes the header format for serialized columns which means that on-disk pickle files are not loadable across a merge of this branch. I can institute backwards-compat if that is required: what's the policy on how best to do this? If more generally serialization is seen as a public API that needs backwards compatibility it is probably worthwhile versioning the metadata so that it's easy to detect moving through breaking change boundaries.
  2. frame_count is now used everywhere in the header metadata for dtypes (needed for this PR) and columns that are serialized; but not yet in other objects (like dataframes etc...). I think it makes sense to push changes there to a subsequent PR.

@wence- wence- force-pushed the wence/column-serialize-cleanup branch from 2cb389a to 573395e Compare May 5, 2022 09:10
@madsbk
Copy link
Member

madsbk commented May 5, 2022

rerun tests

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. Thank you @wence-! I'd let @vyasr take a look before merging.

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @wence-. My only suggestion is that we rename frame_count to frame-count.

@wence-
Copy link
Contributor Author

wence- commented May 5, 2022

Great work @wence-. My only suggestion is that we rename frame_count to frame-count.

I'd prefer to do this in a separate round that just does renaming (since it would pervade further than just these changes).

@wence-
Copy link
Contributor Author

wence- commented May 5, 2022

I'm just going to flag again (since I do not have permissions to change the labels), that this is a breaking change if the serialize interface is considered to be a stable mechanism for storage across releases. As it stands, someone pickling a dataframe in 22.04 and attempting to load it in 22.06 (assuming this PR makes it in) will not be able to in most cases.

I can add backwards compat and deprecation warnings, but would like to know if that's necessary first...

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor comments here, but everything LGTM and the only open question the backwards compatibility layer. The last couple of times that I changed serialization I introduced a backwards compatibility layer that was removed in the next release. @galipremsagar do you think it was valuable? It shouldn't matter for Dask purposes since this would only help data that is serialized and stored over longer periods of time. Do you think that's worth doing again? My 2c: it's a nice-to-have but it's not critical to have.

python/cudf/cudf/core/dtypes.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dtypes.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dtypes.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_serialize.py Show resolved Hide resolved
@galipremsagar
Copy link
Contributor

Some very minor comments here, but everything LGTM and the only open question the backwards compatibility layer. The last couple of times that I changed serialization I introduced a backwards compatibility layer that was removed in the next release. @galipremsagar do you think it was valuable? It shouldn't matter for Dask purposes since this would only help data that is serialized and stored over longer periods of time. Do you think that's worth doing again? My 2c: it's a nice-to-have but it's not critical to have.

I think this one is okay to be a breaking change. Pickling & un-pickling is the only thing that will probably break and that is not guaranteed across versions in general, though it's the opposite for IO readers and writers. So okay to go ahead and not have a backward compatible change here.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vyasr vyasr added breaking Breaking change 5 - Ready to Merge Testing and reviews complete, ready to merge and removed non-breaking Non-breaking change 2 - In Progress Currently a work in progress labels May 6, 2022
@vyasr
Copy link
Contributor

vyasr commented May 6, 2022

@wence- in addition to my suggestions for the assertions, my only other request is that you update the PR description. That description goes into the commit message, and since you are not planning to address the frame-count question in this PR it would be better to remove that checklist before we merge.

@wence- wence- changed the title Column serialize cleanup Generic serialization of all column types May 6, 2022
@wence-
Copy link
Contributor Author

wence- commented May 6, 2022

@wence- in addition to my suggestions for the assertions, my only other request is that you update the PR description. That description goes into the commit message, and since you are not planning to address the frame-count question in this PR it would be better to remove that checklist before we merge.

Thanks, done.

@vyasr
Copy link
Contributor

vyasr commented May 6, 2022

rerun tests

@madsbk
Copy link
Member

madsbk commented May 10, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9def28c into rapidsai:branch-22.06 May 10, 2022
@wence- wence- deleted the wence/column-serialize-cleanup branch May 10, 2022 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Can't round-trip IntervalDtype through serialization [BUG] Serialization of StructColumn fails
5 participants