Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve stability of dask_cudf.DataFrame.var and dask_cudf.DataFrame.std #7453

Merged
merged 12 commits into from
Feb 26, 2021

Conversation

rjzamora
Copy link
Member

Closes #7402

This PR improves the numerical stability of the var (and indirectly std) methods in DataFrame and Series. As discussed in #7402, the existing (naive) approach is problematic for large numbers with relatively small var/std.

Note that follow-up work may be needed to improve the algorithm(s) in groupby.

@rjzamora rjzamora added dask Dask issue non-breaking Non-breaking change labels Feb 25, 2021
@@ -280,6 +281,7 @@ def var(
split_every=False,
dtype=None,
out=None,
naive=False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the naive option once we can confirm that the new approach is more stable and avoids any performance degradation (so far, it seem performance is better anyway).

@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 25, 2021
Copy link

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks great to me! Clean and clearly-implemented. I had one question, but it's primarily for my own education. I'm happy to approve this as soon as that incorrect dependencies error is understood or fixed.

@@ -434,6 +419,78 @@ class Index(Series, dd.core.Index):
_partition_type = cudf.Index


def _naive_var(ddf, meta, skipna, ddof, split_every, out):
num = ddf._get_numeric_data()
x = 1.0 * num.sum(skipna=skipna, split_every=split_every)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question: What does the 1.0 * do for us here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess (though Rick please feel free to correct me) this is needed to cast to a float.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question :)

This _naive_varfunciton is a copy-paste of the original code, so I didn't actually write it. However, my assumption was (as John suggested) that the 1.0 is meant to ensure that all results are cast to float.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha! I figured that might be it. Given that it's a copy-paste, I'm good with it, but in general I would call for an explicit float call if we're not already performing a necessary arithmetic operation. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually a single value? Suspect (though don't know) this is probably a Series or DataFrame, in which case calling float(...) on it won't work

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry; then in that case I'd advocate for a more explicit dtype conversion. Regardless, I'm not hung up on this for this particular PR. It's the sort of thing I'd insist on for new code, but I'm not too fussed about a copy-paste; it can always be cleaned up in a specific code clean-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unclear if a scalar is also a possibility, in which case this is a little handy for allowing us to succinctly cover all those cases, but am not attached to the logic here either (and can see the argument for being more explicit)

Anyways sounds like we can table this for now. Would you like to open an issue and we can revisit later?

@jakirkham jakirkham added 2 - In Progress Currently a work in progress bug Something isn't working labels Feb 25, 2021
@codecov
Copy link

codecov bot commented Feb 25, 2021

Codecov Report

Merging #7453 (0ceeade) into branch-0.19 (43b44e1) will increase coverage by 0.36%.
The diff coverage is 84.69%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7453      +/-   ##
===============================================
+ Coverage        81.80%   82.17%   +0.36%     
===============================================
  Files              101      101              
  Lines            16695    17118     +423     
===============================================
+ Hits             13658    14066     +408     
- Misses            3037     3052      +15     
Impacted Files Coverage Δ
python/cudf/cudf/core/frame.py 89.35% <ø> (+0.09%) ⬆️
python/dask_cudf/dask_cudf/core.py 71.67% <79.41%> (-2.60%) ⬇️
python/cudf/cudf/core/column_accessor.py 95.47% <95.65%> (+2.53%) ⬆️
python/cudf/cudf/core/dataframe.py 90.58% <100.00%> (+0.12%) ⬆️
python/cudf/cudf/utils/gpu_utils.py 53.65% <0.00%> (-4.88%) ⬇️
python/cudf/cudf/utils/docutils.py 97.36% <0.00%> (-2.64%) ⬇️
python/cudf/cudf/core/abc.py 87.23% <0.00%> (-1.14%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f30be67...0ceeade. Read the comment docs.

Comment on lines 439 to 441
# TODO: x.sum()/n seems to be faster than x.mean()
# on Quadro RTX 8000 - Need to compare on V/A100
avg = x.mean(skipna=skipna)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting result: I found that the new approach is slightly slower on my local RTX 8000-based machine unless I use x.sum()/n in place of x.mean(). I plan to get an environment set up on a V100-based machine soon. I will see if I get different behaviour on datacenter-grade hardware and update here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we hold off on merging until you report back on this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! And very surprising. If I can help dig into that one at all, please feel free to ping me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we hold off on merging until you report back on this?

Not sure - I should have an update in the next few hours (just set up a new env, and building cudf now). Should we remove the "naive" path entirely if the new approach has good performance? Either way, if mean is still slower than sum/n I will certainly welcome help from @wphicks :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to be careful here. x.mean() by default is not the same as x.sum()/len(x) if x has nulls and skipna is True (default)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also Nick's point above may explain some of the performance difference (handling nulls vs. not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to be careful here. x.mean() by default is not the same as x.sum()/len(x) if x has nulls and skipna is True (default)

Ah - The case I was testing did not have nulls, but this is a good point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the "sum/n is faster than mean" statement was only true because n was a scalar value. If skipna=True, we actually need the full count for n, because there may be null values. In the skipna=True case, mean is faster (as expected).

Currently, the new var approach seems to be a bit slower than the original, so it may make sense to keep the naive=True option for now.

Copy link

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much to add besides that I did some local testing and results look great now! Thanks a lot @rjzamora !

@rjzamora rjzamora marked this pull request as ready for review February 26, 2021 14:56
@rjzamora rjzamora requested a review from a team as a code owner February 26, 2021 14:56
@jakirkham jakirkham added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Feb 26, 2021
@jakirkham
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit f79a841 into rapidsai:branch-0.19 Feb 26, 2021
@rjzamora rjzamora deleted the var-stable branch February 27, 2021 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working dask Dask issue non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] std incorrectly calculated in dask describe calls for specific dataframe values
6 participants