[REVIEW] Refactor `memory_usage` to improve performance #10537

galipremsagar · 2022-03-29T23:19:51Z

Refactored Frame.memory_usage to return a tuple of lists: (column_names, memory_usages).
Motivation for this change is to remove redundent steps i.e., dict(Frame.memory_usage)->unpack & dict (DataFrame.memory_usage)->unpack dict(in Series.init). Choosing to remove dict being returned by Frame.memory_usage will now result in a 5-10% faster execution of external API. Not a huge speedup but dask memory usage calculations that go through
sizeof_cudf_dataframe are now 200x faster.it quickly adds up when dask_cudf is used.

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3], 'b':['a', 'b', 'c'], 'd':[111, 123, 123]})

# THIS PR
In [3]: %timeit df.memory_usage()
198 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


# branch-22.06
In [3]: %timeit df.memory_usage()
219 µs ± 726 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


# branch-22.06
In [4]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
377 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# this PR
In [6]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
1.8 µs ± 14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

codecov · 2022-03-30T00:32:19Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.06@2d8d913). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.06   #10537   +/-   ##
===============================================
  Coverage                ?   86.33%           
===============================================
  Files                   ?      140           
  Lines                   ?    22300           
  Branches                ?        0           
===============================================
  Hits                    ?    19252           
  Misses                  ?     3048           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d8d913...87589b1. Read the comment docs.

vyasr

If this optimization is really noticeable, then I would suggest the following additional changes:

Remove all inheritance logic from this method, since the amount of shared logic is going to be minimal. Just implement it separately in each of the 4 leaf classes: DataFrame, Series, Index, and MultiIndex.
Don't even bother constructing the list of column names for anything except DataFrame. You don't need it.
MultiIndex definition: sum([col.memory_usage for col in self._data.columns])
GenericIndex definition: self._column.memory_usage
Series implementation: self._column.memory_usage + (self._index.memory_usage if index else 0)
DataFrame implementation: Basically what currently exists except inlining the relevant logic from the current Frame and IndexedFrame implementations.

galipremsagar · 2022-03-30T17:20:16Z

@vyasr Updated the implementations accordingly but kept the inheritances in place for the sake of docstrings. If you want me to remove that as well and duplicate the docstrings in the actual methods let me know, I'm fine with either of the approaches.

bdice

LGTM. One minor suggestion.

python/cudf/cudf/core/multiindex.py

Co-authored-by: Bradley Dice <[email protected]>

vyasr · 2022-03-30T19:09:36Z

Thanks for that catch @bdice.

@galipremsagar could you post new benchmarks so that we know what we're getting out of this?

python/cudf/cudf/core/dataframe.py

python/cudf/cudf/core/series.py

Co-authored-by: Vyas Ramasubramani <[email protected]>

galipremsagar · 2022-03-30T19:19:06Z

@galipremsagar could you post new benchmarks so that we know what we're getting out of this?

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3], 'b':['a', 'b', 'c'], 'd':[111, 123, 123]})

In [3]: df = df.set_index(['a', 'd'])

In [4]: x = df.index

# branch-22.06

In [5]: %timeit x.memory_usage()
12 µs ± 48.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [6]: %timeit df.memory_usage()
232 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)



# Now
In [7]: %timeit x.memory_usage()
10.3 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [6]: %timeit df.memory_usage()
217 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

vyasr · 2022-03-30T19:58:24Z

I'm proposing some alternative solutions offline since I don't think this is quite addressing the underlying issue.

galipremsagar · 2022-03-30T21:07:57Z

Discussed offline with @vyasr and decided to optimize sizeof_cudf_dataframe:

In [4]: df = cudf.DataFrame({'a':[1, 2, 3], 'b':['a', 'b', 'c'], 'd':[111, 123, 123]})

# branch-22.06
In [4]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
377 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# this PR
In [6]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
1.8 µs ± 14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

vyasr

Awesome updating the registered Dask function should hopefully have a much bigger impact on Dask workloads than the relatively minor optimizations that we can make to memory_usage.

python/cudf/cudf/core/dataframe.py

galipremsagar · 2022-03-30T22:49:56Z

rerun tests

jakirkham

Had a couple questions below. Likely just missing context

jakirkham · 2022-03-30T22:22:00Z

python/cudf/cudf/core/frame.py

-        if deep:
-            warnings.warn(
-                "The deep parameter is ignored and is only included "
-                "for pandas compatibility."
-            )
-        return {name: col.memory_usage for name, col in self._data.items()}
+        raise NotImplementedError


Why is this not implemented now?

We have moved all their implementations to their respective Series/DataFrame/MultiIndex/Index .memory_usage methods directly. Kept these parent class methods for the sake of docstrings.

Ok so a user is unlikely to run into this error since they would be using one of these other public facing objects. Is that right?

Yup, that's correct.

python/cudf/cudf/core/indexed_frame.py

galipremsagar · 2022-03-31T00:46:46Z

@gpucibot merge

refactor memory_usage

e600e47

galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 29, 2022

galipremsagar requested a review from a team as a code owner March 29, 2022 23:19

galipremsagar self-assigned this Mar 29, 2022

galipremsagar requested review from vyasr and charlesbluca March 29, 2022 23:19

vyasr requested changes Mar 30, 2022

View reviewed changes

galipremsagar added 2 commits March 30, 2022 07:32

Merge remote-tracking branch 'upstream/branch-22.06' into misc_mem_usage

7cb0384

address reviews

f232a74

galipremsagar requested a review from vyasr March 30, 2022 17:20

bdice approved these changes Mar 30, 2022

View reviewed changes

python/cudf/cudf/core/multiindex.py Outdated Show resolved Hide resolved

Update python/cudf/cudf/core/multiindex.py

816fd54

Co-authored-by: Bradley Dice <[email protected]>

vyasr reviewed Mar 30, 2022

View reviewed changes

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

vyasr reviewed Mar 30, 2022

View reviewed changes

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved

galipremsagar and others added 2 commits March 30, 2022 14:13

Update python/cudf/cudf/core/series.py

15c8528

Co-authored-by: Vyas Ramasubramani <[email protected]>

Update python/cudf/cudf/core/dataframe.py

84951cb

Co-authored-by: Vyas Ramasubramani <[email protected]>

micro-optimize

9df42a5

galipremsagar added 2 commits March 30, 2022 13:57

Merge remote-tracking branch 'upstream/branch-22.06' into misc_mem_usage

b84058a

optimize sizeof

87589b1

galipremsagar requested a review from a team as a code owner March 30, 2022 21:06

vyasr approved these changes Mar 30, 2022

View reviewed changes

shwina reviewed Mar 30, 2022

View reviewed changes

python/cudf/cudf/core/dataframe.py Show resolved Hide resolved

shwina approved these changes Mar 30, 2022

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Mar 30, 2022

jakirkham reviewed Mar 30, 2022

View reviewed changes

rapids-bot bot merged commit 1355191 into rapidsai:branch-22.06 Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Refactor `memory_usage` to improve performance #10537

[REVIEW] Refactor `memory_usage` to improve performance #10537

galipremsagar commented Mar 29, 2022 •

edited

Loading

codecov bot commented Mar 30, 2022 •

edited

Loading

vyasr left a comment

galipremsagar commented Mar 30, 2022

bdice left a comment

vyasr commented Mar 30, 2022

galipremsagar commented Mar 30, 2022 •

edited

Loading

vyasr commented Mar 30, 2022

galipremsagar commented Mar 30, 2022 •

edited

Loading

vyasr left a comment

galipremsagar commented Mar 30, 2022

jakirkham left a comment

jakirkham Mar 30, 2022

galipremsagar Mar 30, 2022

jakirkham Mar 30, 2022

galipremsagar Mar 30, 2022

galipremsagar commented Mar 31, 2022

[REVIEW] Refactor memory_usage to improve performance #10537

[REVIEW] Refactor memory_usage to improve performance #10537

Conversation

galipremsagar commented Mar 29, 2022 • edited Loading

codecov bot commented Mar 30, 2022 • edited Loading

Codecov Report

vyasr left a comment

Choose a reason for hiding this comment

galipremsagar commented Mar 30, 2022

bdice left a comment

Choose a reason for hiding this comment

vyasr commented Mar 30, 2022

galipremsagar commented Mar 30, 2022 • edited Loading

vyasr commented Mar 30, 2022

galipremsagar commented Mar 30, 2022 • edited Loading

vyasr left a comment

Choose a reason for hiding this comment

galipremsagar commented Mar 30, 2022

jakirkham left a comment

Choose a reason for hiding this comment

jakirkham Mar 30, 2022

Choose a reason for hiding this comment

galipremsagar Mar 30, 2022

Choose a reason for hiding this comment

jakirkham Mar 30, 2022

Choose a reason for hiding this comment

galipremsagar Mar 30, 2022

Choose a reason for hiding this comment

galipremsagar commented Mar 31, 2022

[REVIEW] Refactor `memory_usage` to improve performance #10537

[REVIEW] Refactor `memory_usage` to improve performance #10537

galipremsagar commented Mar 29, 2022 •

edited

Loading

codecov bot commented Mar 30, 2022 •

edited

Loading

galipremsagar commented Mar 30, 2022 •

edited

Loading

galipremsagar commented Mar 30, 2022 •

edited

Loading