-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Refactor memory_usage
to improve performance
#10537
[REVIEW] Refactor memory_usage
to improve performance
#10537
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10537 +/- ##
===============================================
Coverage ? 86.33%
===============================================
Files ? 140
Lines ? 22300
Branches ? 0
===============================================
Hits ? 19252
Misses ? 3048
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this optimization is really noticeable, then I would suggest the following additional changes:
- Remove all inheritance logic from this method, since the amount of shared logic is going to be minimal. Just implement it separately in each of the 4 leaf classes: DataFrame, Series, Index, and MultiIndex.
- Don't even bother constructing the list of column names for anything except DataFrame. You don't need it.
- MultiIndex definition:
sum([col.memory_usage for col in self._data.columns])
- GenericIndex definition:
self._column.memory_usage
- Series implementation:
self._column.memory_usage + (self._index.memory_usage if index else 0)
- DataFrame implementation: Basically what currently exists except inlining the relevant logic from the current
Frame
andIndexedFrame
implementations.
@vyasr Updated the implementations accordingly but kept the inheritances in place for the sake of docstrings. If you want me to remove that as well and duplicate the docstrings in the actual methods let me know, I'm fine with either of the approaches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One minor suggestion.
Co-authored-by: Bradley Dice <[email protected]>
Thanks for that catch @bdice. @galipremsagar could you post new benchmarks so that we know what we're getting out of this? |
Co-authored-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
In [1]: import cudf
In [2]: df = cudf.DataFrame({'a':[1, 2, 3], 'b':['a', 'b', 'c'], 'd':[111, 123, 123]})
In [3]: df = df.set_index(['a', 'd'])
In [4]: x = df.index
# branch-22.06
In [5]: %timeit x.memory_usage()
12 µs ± 48.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [6]: %timeit df.memory_usage()
232 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# Now
In [7]: %timeit x.memory_usage()
10.3 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [6]: %timeit df.memory_usage()
217 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
I'm proposing some alternative solutions offline since I don't think this is quite addressing the underlying issue. |
Discussed offline with @vyasr and decided to optimize In [4]: df = cudf.DataFrame({'a':[1, 2, 3], 'b':['a', 'b', 'c'], 'd':[111, 123, 123]})
# branch-22.06
In [4]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
377 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# this PR
In [6]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
1.8 µs ± 14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome updating the registered Dask function should hopefully have a much bigger impact on Dask workloads than the relatively minor optimizations that we can make to memory_usage
.
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a couple questions below. Likely just missing context
if deep: | ||
warnings.warn( | ||
"The deep parameter is ignored and is only included " | ||
"for pandas compatibility." | ||
) | ||
return {name: col.memory_usage for name, col in self._data.items()} | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not implemented now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have moved all their implementations to their respective Series
/DataFrame
/MultiIndex
/Index
.memory_usage
methods directly. Kept these parent class methods for the sake of docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so a user is unlikely to run into this error since they would be using one of these other public facing objects. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's correct.
@gpucibot merge |
Refactored
Frame.memory_usage
to return a tuple of lists: (column_names, memory_usages).Motivation for this change is to remove redundent steps i.e.,
dict
(Frame.memory_usage
)->unpack & dict
(DataFrame.memory_usage
)->unpack dict
(inSeries.init
). Choosing to removedict
being returned byFrame.memory_usage
will now result in a 5-10% faster execution of external API. Not a huge speedup but dask memory usage calculations that go throughsizeof_cudf_dataframe
are now 200x faster.it quickly adds up whendask_cudf
is used.