Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Series covariance and Pearson correlation #2719

Merged
merged 19 commits into from
Oct 2, 2019
Merged

[REVIEW] Series covariance and Pearson correlation #2719

merged 19 commits into from
Oct 2, 2019

Conversation

beckernick
Copy link
Member

@beckernick beckernick commented Sep 3, 2019

Summary of Changes

  • Adds Series-level covariance and Pearson correlation as a stopgap until they are directly supported via libcuDF
    - Adds DataFrame-level covariance and Pearson correlation via CuPy.

A future libcuDF-only version will likely be able to better support DataFrame level covariance and correlation natively. Currently, in Python, we can't do binary ops between DataFrames and Series (#2166) , nor can we do do efficient transposes. We could work around the binary op issue, but we need to transpose to leverage the matrix form of the covariance formula.

As a result, we'd need to independently calculate each covariance value. Casting nans to nulls and dropping nulls is necessary for the math, but adds a few milliseconds per cov/corr call. However, since just the covariance math itself takes 3-5 ms, we would get a significant quadratic explosion if ncols > 10-30 no matter what if we don't use the matrix formula.

Instead, we can use CuPy for now.

@beckernick beckernick requested a review from a team as a code owner September 3, 2019 22:24
@beckernick beckernick self-assigned this Sep 3, 2019
@beckernick beckernick added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. labels Sep 3, 2019
@beckernick beckernick changed the title [WIP] Series covariance and Pearson correlation [WIP] Series and DataFrame covariance and Pearson correlation Sep 18, 2019
@beckernick beckernick changed the title [WIP] Series and DataFrame covariance and Pearson correlation [REVIEW] Series and DataFrame covariance and Pearson correlation Sep 18, 2019
@beckernick beckernick added 3 - Ready for Review Ready for review by team 2 - In Progress Currently a work in progress and removed 2 - In Progress Currently a work in progress 3 - Ready for Review Ready for review by team labels Sep 18, 2019
@beckernick beckernick changed the title [REVIEW] Series and DataFrame covariance and Pearson correlation [WIP] Series and DataFrame covariance and Pearson correlation Sep 18, 2019
@beckernick
Copy link
Member Author

beckernick commented Sep 18, 2019

Dask versions of any corr/cov methods will be blocked until we:

@beckernick beckernick changed the title [WIP] Series and DataFrame covariance and Pearson correlation [REVIEW] Series and DataFrame covariance and Pearson correlation Sep 18, 2019
@beckernick beckernick added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 18, 2019
@beckernick
Copy link
Member Author

This is now ready for review

@beckernick beckernick changed the title [REVIEW] Series and DataFrame covariance and Pearson correlation [REVIEW] Series covariance and Pearson correlation Sep 26, 2019
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 30, 2019
@kkraus14
Copy link
Collaborator

@beckernick approved, suggest you open an issue for a libcudf implementation if you haven't already

@beckernick
Copy link
Member Author

beckernick commented Sep 30, 2019

Will update the existing issue to reflect libcuDF implementation when we merge this. Nevermind, we didn't make an initial issue that fully covers this. Will open one when this merges and cross-reference.

This PR addresses #1267

@kkraus14
Copy link
Collaborator

____________ test_array_func_missing_cudf_series[<lambda>0-np_ar0] _____________

np_ar = array([0.85203364, 0.37148003, 0.24277552, 0.6316116 , 0.8227346 ,
       0.13154083, 0.59179169, 0.68867617, 0.682033...48, 0.30085795, 0.55301161, 0.92713472, 0.34632781,
       0.31003306, 0.79814686, 0.76295427, 0.04999537, 0.89813435])
func = <function <lambda> at 0x7ff21fcabae8>

    @pytest.mark.skipif(missing_arrfunc_cond, reason=missing_arrfunc_reason)
    @pytest.mark.parametrize("np_ar", [np.random.random(100)])
    @pytest.mark.parametrize(
        "func",
        [
            lambda x: np.cov(x, x),
            lambda x: np.dot(x, x),
            lambda x: np.linalg.norm(x),
        ],
    )
    def test_array_func_missing_cudf_series(np_ar, func):
        cudf_ser = cudf.Series(np_ar)
        with pytest.raises(TypeError):
>           func(cudf_ser)
E           Failed: DID NOT RAISE <class 'TypeError'>

cudf/tests/test_array_function.py:53: Failed

@kkraus14 kkraus14 added 0 - Waiting on Author Waiting for author to respond to review and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Sep 30, 2019
@beckernick
Copy link
Member Author

I can't reproduce this test failure

@beckernick
Copy link
Member Author

rerun tests

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 0 - Waiting on Author Waiting for author to respond to review labels Oct 1, 2019
@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 1, 2019

@beckernick You need to set the environment variable: export NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1, I'm guessing it's because you're implementing cov that it's working for numpy __array_function__ np.cov

@beckernick
Copy link
Member Author

beckernick commented Oct 1, 2019

Yep, you're right. Will update the test

@codecov
Copy link

codecov bot commented Oct 1, 2019

Codecov Report

Merging #2719 into branch-0.10 will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.10    #2719      +/-   ##
===============================================
+ Coverage        86.75%   86.81%   +0.05%     
===============================================
  Files               49       49              
  Lines             9055     9081      +26     
===============================================
+ Hits              7856     7884      +28     
+ Misses            1199     1197       -2
Impacted Files Coverage Δ
python/cudf/cudf/core/series.py 93.7% <100%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a902d7c...bf591b8. Read the comment docs.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 1, 2019

rerun tests

1 similar comment
@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 2, 2019

rerun tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants