-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve stability of dask_cudf.DataFrame.var and dask_cudf.DataFrame.std #7453
Conversation
@@ -280,6 +281,7 @@ def var( | |||
split_every=False, | |||
dtype=None, | |||
out=None, | |||
naive=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the naive
option once we can confirm that the new approach is more stable and avoids any performance degradation (so far, it seem performance is better anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks great to me! Clean and clearly-implemented. I had one question, but it's primarily for my own education. I'm happy to approve this as soon as that incorrect dependencies
error is understood or fixed.
@@ -434,6 +419,78 @@ class Index(Series, dd.core.Index): | |||
_partition_type = cudf.Index | |||
|
|||
|
|||
def _naive_var(ddf, meta, skipna, ddof, split_every, out): | |||
num = ddf._get_numeric_data() | |||
x = 1.0 * num.sum(skipna=skipna, split_every=split_every) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naive question: What does the 1.0 *
do for us here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess (though Rick please feel free to correct me) this is needed to cast to a float
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question :)
This _naive_var
funciton is a copy-paste of the original code, so I didn't actually write it. However, my assumption was (as John suggested) that the 1.0
is meant to ensure that all results are cast to float
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha! I figured that might be it. Given that it's a copy-paste, I'm good with it, but in general I would call for an explicit float
call if we're not already performing a necessary arithmetic operation. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually a single value? Suspect (though don't know) this is probably a Series or DataFrame, in which case calling float(...)
on it won't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry; then in that case I'd advocate for a more explicit dtype conversion. Regardless, I'm not hung up on this for this particular PR. It's the sort of thing I'd insist on for new code, but I'm not too fussed about a copy-paste; it can always be cleaned up in a specific code clean-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah unclear if a scalar is also a possibility, in which case this is a little handy for allowing us to succinctly cover all those cases, but am not attached to the logic here either (and can see the argument for being more explicit)
Anyways sounds like we can table this for now. Would you like to open an issue and we can revisit later?
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7453 +/- ##
===============================================
+ Coverage 81.80% 82.17% +0.36%
===============================================
Files 101 101
Lines 16695 17118 +423
===============================================
+ Hits 13658 14066 +408
- Misses 3037 3052 +15
Continue to review full report at Codecov.
|
python/dask_cudf/dask_cudf/core.py
Outdated
# TODO: x.sum()/n seems to be faster than x.mean() | ||
# on Quadro RTX 8000 - Need to compare on V/A100 | ||
avg = x.mean(skipna=skipna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting result: I found that the new approach is slightly slower on my local RTX 8000-based machine unless I use x.sum()/n
in place of x.mean()
. I plan to get an environment set up on a V100-based machine soon. I will see if I get different behaviour on datacenter-grade hardware and update here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we hold off on merging until you report back on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! And very surprising. If I can help dig into that one at all, please feel free to ping me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we hold off on merging until you report back on this?
Not sure - I should have an update in the next few hours (just set up a new env, and building cudf now). Should we remove the "naive" path entirely if the new approach has good performance? Either way, if mean
is still slower than sum/n
I will certainly welcome help from @wphicks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to be careful here. x.mean()
by default is not the same as x.sum()/len(x)
if x has nulls and skipna is True (default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also Nick's point above may explain some of the performance difference (handling nulls vs. not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to be careful here. x.mean() by default is not the same as x.sum()/len(x) if x has nulls and skipna is True (default)
Ah - The case I was testing did not have nulls, but this is a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the "sum/n is faster than mean" statement was only true because n
was a scalar value. If skipna=True
, we actually need the full count
for n
, because there may be null values. In the skipna=True
case, mean
is faster (as expected).
Currently, the new var
approach seems to be a bit slower than the original, so it may make sense to keep the naive=True
option for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much to add besides that I did some local testing and results look great now! Thanks a lot @rjzamora !
…ccational failures are not related to var)
@gpucibot merge |
Closes #7402
This PR improves the numerical stability of the
var
(and indirectlystd
) methods inDataFrame
andSeries
. As discussed in #7402, the existing (naive) approach is problematic for large numbers with relatively small var/std.Note that follow-up work may be needed to improve the algorithm(s) in groupby.