Improve stability of dask_cudf.DataFrame.var and dask_cudf.DataFrame.std #7453

rjzamora · 2021-02-25T19:47:55Z

This PR improves the numerical stability of the var (and indirectly std) methods in DataFrame and Series. As discussed in #7402, the existing (naive) approach is problematic for large numbers with relatively small var/std.

Note that follow-up work may be needed to improve the algorithm(s) in groupby.

rjzamora · 2021-02-25T19:49:42Z

python/dask_cudf/dask_cudf/core.py

@@ -280,6 +281,7 @@ def var(
        split_every=False,
        dtype=None,
        out=None,
+        naive=False,


We can remove the naive option once we can confirm that the new approach is more stable and avoids any performance degradation (so far, it seem performance is better anyway).

python/dask_cudf/dask_cudf/tests/test_core.py

wphicks

This approach looks great to me! Clean and clearly-implemented. I had one question, but it's primarily for my own education. I'm happy to approve this as soon as that incorrect dependencies error is understood or fixed.

wphicks · 2021-02-25T20:07:27Z

python/dask_cudf/dask_cudf/core.py

@@ -434,6 +419,78 @@ class Index(Series, dd.core.Index):
    _partition_type = cudf.Index


+def _naive_var(ddf, meta, skipna, ddof, split_every, out):
+    num = ddf._get_numeric_data()
+    x = 1.0 * num.sum(skipna=skipna, split_every=split_every)


Naive question: What does the 1.0 * do for us here?

My guess (though Rick please feel free to correct me) this is needed to cast to a float.

Good question :)

This _naive_varfunciton is a copy-paste of the original code, so I didn't actually write it. However, my assumption was (as John suggested) that the 1.0 is meant to ensure that all results are cast to float.

Ah gotcha! I figured that might be it. Given that it's a copy-paste, I'm good with it, but in general I would call for an explicit float call if we're not already performing a necessary arithmetic operation. Thanks!

Is this actually a single value? Suspect (though don't know) this is probably a Series or DataFrame, in which case calling float(...) on it won't work

Ah, sorry; then in that case I'd advocate for a more explicit dtype conversion. Regardless, I'm not hung up on this for this particular PR. It's the sort of thing I'd insist on for new code, but I'm not too fussed about a copy-paste; it can always be cleaned up in a specific code clean-up PR.

Yeah unclear if a scalar is also a possibility, in which case this is a little handy for allowing us to succinctly cover all those cases, but am not attached to the logic here either (and can see the argument for being more explicit)

Anyways sounds like we can table this for now. Would you like to open an issue and we can revisit later?

codecov · 2021-02-25T22:58:37Z

Codecov Report

Merging #7453 (0ceeade) into branch-0.19 (43b44e1) will increase coverage by 0.36%.
The diff coverage is 84.69%.

@@               Coverage Diff               @@
##           branch-0.19    #7453      +/-   ##
===============================================
+ Coverage        81.80%   82.17%   +0.36%     
===============================================
  Files              101      101              
  Lines            16695    17118     +423     
===============================================
+ Hits             13658    14066     +408     
- Misses            3037     3052      +15

Impacted Files	Coverage Δ
python/cudf/cudf/core/frame.py	`89.35% <ø> (+0.09%)`	⬆️
python/dask_cudf/dask_cudf/core.py	`71.67% <79.41%> (-2.60%)`	⬇️
python/cudf/cudf/core/column_accessor.py	`95.47% <95.65%> (+2.53%)`	⬆️
python/cudf/cudf/core/dataframe.py	`90.58% <100.00%> (+0.12%)`	⬆️
python/cudf/cudf/utils/gpu_utils.py	`53.65% <0.00%> (-4.88%)`	⬇️
python/cudf/cudf/utils/docutils.py	`97.36% <0.00%> (-2.64%)`	⬇️
python/cudf/cudf/core/abc.py	`87.23% <0.00%> (-1.14%)`	⬇️
python/cudf/cudf/io/feather.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/comm/serialize.py	`0.00% <0.00%> (ø)`
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f30be67...0ceeade. Read the comment docs.

rjzamora · 2021-02-26T00:40:34Z

python/dask_cudf/dask_cudf/core.py

+        # TODO: x.sum()/n seems to be faster than x.mean()
+        # on Quadro RTX 8000 - Need to compare on V/A100
+        avg = x.mean(skipna=skipna)


Interesting result: I found that the new approach is slightly slower on my local RTX 8000-based machine unless I use x.sum()/n in place of x.mean(). I plan to get an environment set up on a V100-based machine soon. I will see if I get different behaviour on datacenter-grade hardware and update here.

Should we hold off on merging until you report back on this?

Interesting! And very surprising. If I can help dig into that one at all, please feel free to ping me.

Should we hold off on merging until you report back on this?

Not sure - I should have an update in the next few hours (just set up a new env, and building cudf now). Should we remove the "naive" path entirely if the new approach has good performance? Either way, if mean is still slower than sum/n I will certainly welcome help from @wphicks :)

We may want to be careful here. x.mean() by default is not the same as x.sum()/len(x) if x has nulls and skipna is True (default)

Also Nick's point above may explain some of the performance difference (handling nulls vs. not)

We may want to be careful here. x.mean() by default is not the same as x.sum()/len(x) if x has nulls and skipna is True (default)

Ah - The case I was testing did not have nulls, but this is a good point.

I can confirm that the "sum/n is faster than mean" statement was only true because n was a scalar value. If skipna=True, we actually need the full count for n, because there may be null values. In the skipna=True case, mean is faster (as expected).

Currently, the new var approach seems to be a bit slower than the original, so it may make sense to keep the naive=True option for now.

wphicks

LGTM!

pentschev

I don't have much to add besides that I did some local testing and results look great now! Thanks a lot @rjzamora !

…ccational failures are not related to var)

jakirkham · 2021-02-26T23:43:47Z

@gpucibot merge

rjzamora added 4 commits February 25, 2021 10:25

basic var algorithm change

8353d68

api cleanup

ba71d7d

fix test

943d3b4

add series test

d8db547

rjzamora added dask Dask issue non-breaking Non-breaking change labels Feb 25, 2021

rjzamora commented Feb 25, 2021

View reviewed changes

python/dask_cudf/dask_cudf/tests/test_core.py Outdated Show resolved Hide resolved

github-actions bot added the Python Affects Python cuDF API. label Feb 25, 2021

jakirkham requested review from shwina, wphicks and pentschev February 25, 2021 20:02

wphicks reviewed Feb 25, 2021

View reviewed changes

jakirkham added 2 - In Progress Currently a work in progress bug Something isn't working labels Feb 25, 2021

rjzamora added 2 commits February 25, 2021 16:29

fix bug in reduction

311756f

fix bug in reduction

25de14c

rjzamora commented Feb 26, 2021

View reviewed changes

wphicks approved these changes Feb 26, 2021

View reviewed changes

rjzamora added 5 commits February 25, 2021 17:38

Merge remote-tracking branch 'upstream/branch-0.19' into var-stable

c75faaf

fix count logic

cb570a7

trigger reformatting

301f52e

avoid float dep warning

56dcbe4

cover edge case (single row)

e68d4a2

pentschev approved these changes Feb 26, 2021

View reviewed changes

change test_large_numbers_describe to test_large_numbers_var (seems o…

0ceeade

…ccational failures are not related to var)

rjzamora marked this pull request as ready for review February 26, 2021 14:56

rjzamora requested a review from a team as a code owner February 26, 2021 14:56

kkraus14 approved these changes Feb 26, 2021

View reviewed changes

jakirkham approved these changes Feb 26, 2021

View reviewed changes

jakirkham added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Feb 26, 2021

rapids-bot bot merged commit f79a841 into rapidsai:branch-0.19 Feb 26, 2021

rjzamora deleted the var-stable branch February 27, 2021 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve stability of dask_cudf.DataFrame.var and dask_cudf.DataFrame.std #7453

Improve stability of dask_cudf.DataFrame.var and dask_cudf.DataFrame.std #7453

rjzamora commented Feb 25, 2021

rjzamora Feb 25, 2021

wphicks left a comment

wphicks Feb 25, 2021

jakirkham Feb 25, 2021

rjzamora Feb 25, 2021

wphicks Feb 25, 2021

jakirkham Feb 25, 2021

wphicks Feb 25, 2021

jakirkham Feb 25, 2021

codecov bot commented Feb 25, 2021 •

edited

Loading

rjzamora Feb 26, 2021

kkraus14 Feb 26, 2021

wphicks Feb 26, 2021

rjzamora Feb 26, 2021

beckernick Feb 26, 2021

jakirkham Feb 26, 2021

rjzamora Feb 26, 2021

rjzamora Feb 26, 2021

wphicks left a comment

pentschev left a comment •

edited

Loading

jakirkham commented Feb 26, 2021

Improve stability of dask_cudf.DataFrame.var and dask_cudf.DataFrame.std #7453

Improve stability of dask_cudf.DataFrame.var and dask_cudf.DataFrame.std #7453

Conversation

rjzamora commented Feb 25, 2021

Choose a reason for hiding this comment

wphicks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 25, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wphicks left a comment

Choose a reason for hiding this comment

pentschev left a comment • edited Loading

Choose a reason for hiding this comment

jakirkham commented Feb 26, 2021

codecov bot commented Feb 25, 2021 •

edited

Loading

pentschev left a comment •

edited

Loading