-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Series covariance and Pearson correlation #2719
[REVIEW] Series covariance and Pearson correlation #2719
Conversation
Dask versions of any
|
This is now ready for review |
@beckernick approved, suggest you open an issue for a libcudf implementation if you haven't already |
This PR addresses #1267 |
|
I can't reproduce this test failure |
rerun tests |
@beckernick You need to set the environment variable: |
Yep, you're right. Will update the test |
…tocol tests due to implementation
…nick/cudf into feature/python-covariance
Codecov Report
@@ Coverage Diff @@
## branch-0.10 #2719 +/- ##
===============================================
+ Coverage 86.75% 86.81% +0.05%
===============================================
Files 49 49
Lines 9055 9081 +26
===============================================
+ Hits 7856 7884 +28
+ Misses 1199 1197 -2
Continue to review full report at Codecov.
|
rerun tests |
1 similar comment
rerun tests |
Summary of Changes
- Adds DataFrame-level covariance and Pearson correlation via CuPy.A future libcuDF-only version will likely be able to better support DataFrame level covariance and correlation natively. Currently, in Python, we can't do binary ops between DataFrames and Series (#2166) , nor can we do do efficient transposes. We could work around the binary op issue, but we need to transpose to leverage the matrix form of the covariance formula.
As a result, we'd need to independently calculate each covariance value. Casting nans to nulls and dropping nulls is necessary for the math, but adds a few milliseconds percov/corr
call. However, since just the covariance math itself takes 3-5 ms, we would get a significant quadratic explosion ifncols > 10-30
no matter what if we don't use the matrix formula.Instead, we can use CuPy for now.