-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic serialization of all column types #10784
Generic serialization of all column types #10784
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10784 +/- ##
================================================
- Coverage 86.40% 86.38% -0.02%
================================================
Files 143 143
Lines 22448 22442 -6
================================================
- Hits 19396 19387 -9
- Misses 3052 3055 +3
Continue to review full report at Codecov.
|
We can't defer to the superclass serialization routines since the closed-ness of the interval will be unhandled. Closes rapidsai#10785.
Will remove the need for ad-hoc handling of dtypes when serializing columns.
c827bf5
to
f159f63
Compare
@wence- thanks for this! I didn't get around to taking a look today, but I should be able to tomorrow. |
Handle children and data types in a principled way, removing need for special-casing in subclasses.
f159f63
to
2cb389a
Compare
I think this is now ready for review. Some points to note:
|
2cb389a
to
573395e
Compare
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @wence-. My only suggestion is that we rename frame_count
to frame-count
.
I'd prefer to do this in a separate round that just does renaming (since it would pervade further than just these changes). |
I'm just going to flag again (since I do not have permissions to change the labels), that this is a breaking change if the serialize interface is considered to be a stable mechanism for storage across releases. As it stands, someone pickling a dataframe in 22.04 and attempting to load it in 22.06 (assuming this PR makes it in) will not be able to in most cases. I can add backwards compat and deprecation warnings, but would like to know if that's necessary first... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very minor comments here, but everything LGTM and the only open question the backwards compatibility layer. The last couple of times that I changed serialization I introduced a backwards compatibility layer that was removed in the next release. @galipremsagar do you think it was valuable? It shouldn't matter for Dask purposes since this would only help data that is serialized and stored over longer periods of time. Do you think that's worth doing again? My 2c: it's a nice-to-have but it's not critical to have.
I think this one is okay to be a breaking change. Pickling & un-pickling is the only thing that will probably break and that is not guaranteed across versions in general, though it's the opposite for IO readers and writers. So okay to go ahead and not have a backward compatible change here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@wence- in addition to my suggestions for the assertions, my only other request is that you update the PR description. That description goes into the commit message, and since you are not planning to address the |
Thanks, done. |
rerun tests |
@gpucibot merge |
Prior to this change, not all column types were serializable, with
serialization implemented in an ad-hoc way for each type in turn. The
inheritance is such that we can just implement generic serialization on the
base column class (no subclass has a specialized constructor), so do
that (closes #10766).
To support this implement serialization for all dtypes. These must
individually implement the
Serializable
interface since in contrast tocolumns every dtype has its own constructor (closes #10785).