Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove abc inheritance from Serializable #8254

Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented May 14, 2021

Currently the Serializable class provides serialize and deserialize as abstractmethods via the mechanisms afforded by inheritance from abc.ABC. Since this class is purely internal to cudf and is not describing an abstract interface in a manner useful to consumers of our code, the benefits of the abstract base class concept are outweighed by the performance and maintenance costs. In particular, isinstance checks on subclasses of abc.ABC are much more expensive than for normal classes (due to an expensive implementation of __instancecheck__), and (for better or worse) our code base currently makes use of these checks extensively. In addition, in certain places we can benefit from the use of custom metaclasses in cudf, but their usage becomes more cumbersome with ABC because metaclasses then also have to inherit from ABCMeta (which brings along any associated complexities). This PR removes that inheritance, replacing it with a much simpler approach that simply implements serialize and deserialize as raising NotImplementedError.

@vyasr vyasr added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. Performance Performance related issue non-breaking Non-breaking change labels May 14, 2021
@vyasr vyasr self-assigned this May 14, 2021
@vyasr vyasr requested a review from a team as a code owner May 14, 2021 21:40
@vyasr vyasr requested review from shwina and brandon-b-miller May 14, 2021 21:40
@vyasr vyasr added the improvement Improvement / enhancement to an existing function label May 14, 2021
@vyasr
Copy link
Contributor Author

vyasr commented May 14, 2021

Here are some back of the envelope benchmarks run using IPython's timeit magic. Note that the last three rows (creating a cudf.Index) are in microseconds, whereas the first six (simple isinstance checks) are in nanoseconds.

Old New
isinstance(1, cudf.DataFrame) 468 ns ± 8.86 ns 133 ns ± 1.53 ns
isinstance(1, cudf.Series) 459 ns ± 3.12 ns 134 ns ± 3.08 ns
isinstance(1, cudf.Index 480 ns ± 0.889 ns 129 ns ± 2.3 ns
isinstance(df, cudf.DataFrame) 115 ns ± 0.236 ns 129 ns ± 1.06 ns
isinstance(ser, cudf.Series) 772 ns ± 3.3 ns 156 ns ± 1.33 ns
isinstance(idx, cudf.Index) 776 ns ± 16.1 ns 155 ns ± 2.88 ns
cudf.Index(df) 347 µs ± 1.64 µs 307 µs ± 5.09 µs
cudf.Index(ser) 416 µs ± 205 ns 388 µs ± 3.59 µs
cudf.Index(idx) 417 µs ± 1.2 µs 386 µs ± 1.53 µs

@codecov
Copy link

codecov bot commented May 15, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@92a432d). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.06    #8254   +/-   ##
===============================================
  Coverage                ?   82.89%           
===============================================
  Files                   ?      105           
  Lines                   ?    17934           
  Branches                ?        0           
===============================================
  Hits                    ?    14866           
  Misses                  ?     3068           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92a432d...e0b86d6. Read the comment docs.

@vyasr vyasr changed the title Remove abc inheritance from Serializable Remove abc inheritance from Serializable May 17, 2021
@vyasr vyasr changed the title Remove abc inheritance from Serializable Remove abc inheritance from Serializable May 17, 2021
@shwina
Copy link
Contributor

shwina commented May 18, 2021

LGTM. @jakirkham could you please take a quick look too when you have a moment? :)

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jakirkham
Copy link
Member

Just to note, cuML uses Serializable via Buffer

cc @dantegd

@jakirkham
Copy link
Member

Are there use cases where this isinstance overhead is noticeable? Do we have an end user workflow where making this change has an impact?

@vyasr
Copy link
Contributor Author

vyasr commented May 18, 2021

Are there use cases where this isinstance overhead is noticeable? Do we have an end user workflow where making this change has an impact?

@jakirkham Yes, unfortunately. In an ideal world we would rely on ducktyping everywhere, but that's definitely not the case internally for various reasons. As a result, lots of different code paths are impacted here. I included a benchmark above of the difference in constructing an Index from different types, here's the relevant data again:

Old New
cudf.Index(df) 347 µs ± 1.64 µs 307 µs ± 5.09 µs
cudf.Index(ser) 416 µs ± 205 ns 388 µs ± 3.59 µs
cudf.Index(idx) 417 µs ± 1.2 µs 386 µs ± 1.53 µs

That was just intended as a representative example. The same type of difference will be observed in a large number of different operations. Take this simple binop example (run via IPython):

In [1]: import cudf
In [2]: x = cudf.Series([1, 2, 3])
In [3]: %timeit x + x

The before/after numbers there are 158 µs ± 961 ns/139 µs ± 619 ns per loop. If I change that to a Series of size 1 million instead of 3, the difference is comparable (as expected): 531 µs ± 5.86 µs/508 µs ± 4.17 µs. Obviously these aren't massive changes, but they can add up when the effect is so pervasive throughout the code base. As long as we're refactoring the code with performance in mind we may as well avoid introducing unnecessary overheads like this.

In addition, removing ABCMeta from the metaclass hierarchy of all our classes makes it simpler to add custom metaclasses if necessary. Custom metaclasses are another thing that I'd love to avoid, but there's at least one case where matching pandas semantics requires us to do some things that would be more cleanly accomplished using a custom metaclass than by any other approach (including what we currently do) and it's easier to not have to worry about how overriding methods like __instancecheck__ will behave when creating a metaclass that inherits from ABCMeta.

@shwina
Copy link
Contributor

shwina commented May 20, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 7427049 into rapidsai:branch-21.06 May 20, 2021
@vyasr vyasr added this to the cuDF Python Refactoring milestone Jul 22, 2021
@vyasr vyasr deleted the feature/remove_serializable_abc branch January 14, 2022 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants