Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pearson correlation for sort groupby (python) #9166

Merged
merged 149 commits into from
Nov 30, 2021
Merged
Show file tree
Hide file tree
Changes from 141 commits
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
763c53a
add CORR aggregation to groupby, headers, classes, visitor(sort)
karthikeyann Aug 31, 2021
4c989a9
add group_corr.cu
karthikeyann Aug 31, 2021
015795c
add unit test temporarily
karthikeyann Aug 31, 2021
ba6e50a
create new PR for pearson groupby correlation
skirui-source Sep 2, 2021
b198a51
adding corr. func in python
skirui-source Sep 2, 2021
b7464a2
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 3, 2021
3d00307
Revert "create new PR for pearson groupby correlation"
karthikeyann Sep 6, 2021
1200437
Revert "adding corr. func in python"
karthikeyann Sep 6, 2021
178f28a
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into fea-sort…
karthikeyann Sep 6, 2021
60293cc
rename CORR to CORRELATION, added correlation_type as arg
karthikeyann Sep 6, 2021
d421d6d
add shallow_hash(column_view)
karthikeyann Sep 7, 2021
9c4a9f3
add CompoundTypes to type_lists
karthikeyann Sep 7, 2021
a3dd235
add shallow_hash tests
karthikeyann Sep 7, 2021
2365d07
add column copy test
karthikeyann Sep 7, 2021
88726a4
add shallow_equal(column_view) and tests
karthikeyann Sep 7, 2021
d52509d
update result_cache to use shallow_hash, shallow_equal
karthikeyann Sep 8, 2021
d9a8bd7
Update cpp/include/cudf/column/column_view.hpp
karthikeyann Sep 8, 2021
d96f870
added definition of correlation() in cython
skirui-source Sep 8, 2021
b6b92df
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 9, 2021
e3d6877
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 9, 2021
7e7f250
ignore data, nullmask, offset if parent size is empty
karthikeyann Sep 13, 2021
d1d5c3c
Merge branch 'fea-shallow_hash_columnview' of github.com:karthikeyann…
karthikeyann Sep 13, 2021
522e5a3
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 13, 2021
0005154
is_shallow_equal ignore children states for empty column. (not childr…
karthikeyann Sep 13, 2021
002b777
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 14, 2021
82b5a26
set STRUCT_AGGS to CORRELATION
skirui-source Sep 14, 2021
e692053
for empty column, ignore child pointers in shallow_hash
karthikeyann Sep 14, 2021
44372bc
rename is_shallow_equal to is_shallow_equivalent
karthikeyann Sep 14, 2021
d52fd53
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 15, 2021
d3e0053
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 15, 2021
e32935e
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 15, 2021
3aab04f
added ctypedef correlation_type. need to add tests
skirui-source Sep 15, 2021
ecc3a7d
use hash_combine for shallow hash
karthikeyann Sep 16, 2021
d2cd468
Apply suggestions from code review (jake)
karthikeyann Sep 16, 2021
fa40847
address review comments
karthikeyann Sep 17, 2021
f709b2a
Merge branch 'fea-shallow_hash_columnview' of github.com:karthikeyann…
karthikeyann Sep 17, 2021
6ac5725
update after PR #9185 updates
karthikeyann Sep 17, 2021
e863bc7
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into enh-grou…
karthikeyann Sep 17, 2021
f66fdd9
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into fea-shal…
karthikeyann Sep 18, 2021
e36b834
add boost license for hash_combine, move to diff header
karthikeyann Sep 18, 2021
a1ff894
Merge branch 'fea-shallow_hash_columnview' of github.com:karthikeyann…
karthikeyann Sep 18, 2021
1fbe3fc
Apply suggestions from code review (jake)
karthikeyann Sep 18, 2021
79ca5e5
Merge branches 'enh-groupby_cache_hashed' and 'fea-shallow_hash_colum…
karthikeyann Sep 18, 2021
fc3cc6b
include cleanup
karthikeyann Sep 18, 2021
eb2b0db
Merge branch 'fea-shallow_hash_columnview' of github.com:karthikeyann…
karthikeyann Sep 18, 2021
0593955
Merge branch 'fea-shallow_hash_columnview' of github.com:karthikeyann…
karthikeyann Sep 18, 2021
f7b6bb6
add missing include due to reorg
karthikeyann Sep 18, 2021
5b269ef
Merge branch 'enh-groupby_cache_hashed' of github.com:karthikeyann/cu…
karthikeyann Sep 18, 2021
7db9870
update groupby corr to use hashed result cache
karthikeyann Sep 18, 2021
5bb1dc4
Revert "set STRUCT_AGGS to CORRELATION"
karthikeyann Sep 18, 2021
fb98fd5
Revert "added ctypedef correlation_type. need to add tests"
karthikeyann Sep 18, 2021
324c37d
Revert "added definition of correlation() in cython"
karthikeyann Sep 18, 2021
9f19ddf
Apply suggestions from code review (jake)
karthikeyann Sep 20, 2021
ab955bb
enable result caching of child columns in correlation
karthikeyann Sep 20, 2021
98bbc94
fix duplicate {col, agg} request extract
karthikeyann Sep 20, 2021
e750cae
Merge branch 'enh-groupby_cache_hashed' of github.com:karthikeyann/cu…
karthikeyann Sep 20, 2021
b2bd176
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 20, 2021
9581525
address review comments
karthikeyann Sep 20, 2021
dcb0668
Merge branch 'fea-shallow_hash_columnview' of github.com:karthikeyann…
karthikeyann Sep 20, 2021
243490b
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into enh-grou…
karthikeyann Sep 20, 2021
8d71146
Merge branch 'enh-groupby_cache_hashed' of github.com:karthikeyann/cu…
karthikeyann Sep 20, 2021
1a5f367
Update cpp/src/column/column_view.cpp
karthikeyann Sep 21, 2021
8fa765c
Merge branch 'fea-shallow_hash_columnview' of github.com:karthikeyann…
karthikeyann Sep 22, 2021
3e41c64
Merge branch 'enh-groupby_cache_hashed' of github.com:karthikeyann/cu…
karthikeyann Sep 22, 2021
2821003
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
skirui-source Sep 22, 2021
1b84c25
Merge branch 'fea-sortgroupby_corr' of github.com:karthikeyann/cudf i…
skirui-source Sep 22, 2021
63af02d
add groupby correlation tests
karthikeyann Sep 24, 2021
14dd5bb
enable dict for sort groupby mean
karthikeyann Sep 24, 2021
b0fea02
update group_corr for null support
karthikeyann Sep 24, 2021
57db901
rename group_corr to group_correlation
karthikeyann Sep 24, 2021
0d1a91e
update doc
karthikeyann Sep 24, 2021
e10ca8c
Merge branch 'branch-21.12' of github.com:rapidsai/cudf into enh-grou…
karthikeyann Sep 24, 2021
f6a56ce
Merge branch 'enh-groupby_cache_hashed' of github.com:karthikeyann/cu…
karthikeyann Sep 24, 2021
6cd47bc
minor comment corrections
karthikeyann Sep 27, 2021
4c2611b
Merge branch 'branch-21.12' of github.com:rapidsai/cudf into fea-sort…
karthikeyann Sep 27, 2021
075ec73
add covariance, refactor correlation to use covariance
karthikeyann Sep 30, 2021
dad641e
Merge branch 'branch-21.12' of github.com:rapidsai/cudf into fea-sort…
karthikeyann Sep 30, 2021
38e9ddc
Merge branch 'branch-21.12' of github.com:rapidsai/cudf into fea-sort…
karthikeyann Sep 30, 2021
60532e8
create new PR
skirui-source Sep 30, 2021
6e6459d
Merge branch 'branch-21.12' of github.com:rapidsai/cudf into fea-sort…
karthikeyann Oct 4, 2021
077a187
add more null cases for correlation tests
karthikeyann Oct 4, 2021
e3f47c1
add covariance tests
karthikeyann Oct 4, 2021
6703533
Merge branch 'branch-21.12' into fea-sortgroupby_corr
karthikeyann Oct 4, 2021
7ed3a45
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 5, 2021
dd00e0d
Merge branch 'branch-21.12' into fea-sortgroupby_corr
karthikeyann Oct 5, 2021
1483c46
Merge branch 'pearson_coeff' of github.com:skirui-source/cudf into pe…
skirui-source Oct 6, 2021
482bcdf
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 6, 2021
9c5b81d
fixed merge conflict in result_cache.hpp
skirui-source Oct 6, 2021
dd21ec5
fixed merge conflict in groupby.cu
skirui-source Oct 6, 2021
bf30a22
Merge branch 'pearson_coeff' of github.com:skirui-source/cudf into pe…
skirui-source Oct 6, 2021
8426f56
Apply suggestions from code review
karthikeyann Oct 8, 2021
22981de
Merge branch 'fea-sortgroupby_corr' of github.com:karthikeyann/cudf i…
skirui-source Oct 9, 2021
864bd84
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 9, 2021
9c5e821
Merge branch 'fea-sortgroupby_corr' of github.com:karthikeyann/cudf i…
skirui-source Oct 12, 2021
f7470d2
fixing multiindex to match pandas behavior
skirui-source Oct 13, 2021
d4e289c
Merge branch 'pearson_coeff' of github.com:skirui-source/cudf into pe…
skirui-source Oct 13, 2021
6a96722
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 14, 2021
5e28938
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 15, 2021
7ec37e3
Merge branch 'branch-21.12' of github.com:rapidsai/cudf into pearson_…
karthikeyann Oct 18, 2021
a67a473
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 18, 2021
5b958ea
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 19, 2021
128242a
Merge branch 'pearson_coeff' of github.com:skirui-source/cudf into pe…
skirui-source Oct 19, 2021
407b616
adding tests
skirui-source Oct 19, 2021
812ffe3
fixed merge conflict in group_correlation.cu
skirui-source Oct 19, 2021
a96f2d8
Merge branch 'pearson_coeff' of github.com:skirui-source/cudf into pe…
skirui-source Oct 19, 2021
00b9578
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 20, 2021
c58cff3
added method parameter to corr()
skirui-source Oct 21, 2021
ed31b5a
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 27, 2021
0a0a935
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 27, 2021
56d2baa
Merge branch 'pearson_coeff' of github.com:skirui-source/cudf into pe…
skirui-source Oct 27, 2021
70be97b
create multiindex using groupby correlated index info
skirui-source Oct 28, 2021
f906b79
added tests - one, two, three columns cases
skirui-source Oct 28, 2021
1e0ebe5
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Oct 28, 2021
d800c89
added min_periods param. to cython layer
skirui-source Nov 3, 2021
db8b47f
create new_df from grouping keys data
skirui-source Nov 3, 2021
9f5dd11
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Nov 3, 2021
334bd03
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Nov 8, 2021
36baa30
updated copyright years
skirui-source Nov 9, 2021
54ef35b
added test for nulls and unsupoorted methods
skirui-source Nov 9, 2021
1e1431b
minor review-fixes
skirui-source Nov 10, 2021
2ecf951
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Nov 10, 2021
ab6cd95
added tests for: invalid types, empty dataframe and multiindex. All f…
skirui-source Nov 10, 2021
34d412e
added test for grouping by multiple columns, passes
skirui-source Nov 10, 2021
b642049
fixes multiindex to match pd for multiple groupings-cases
skirui-source Nov 10, 2021
0c2e17e
changes:call with ashwin-create MI for non empty results, capture run…
skirui-source Nov 10, 2021
124c576
all tests passing now
skirui-source Nov 10, 2021
6b51c82
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Nov 11, 2021
1beb0fb
Merge branch 'pearson_coeff' of github.com:skirui-source/cudf into pe…
skirui-source Nov 11, 2021
23bfff7
added corr() aggregation to cudf GroupBy docs
skirui-source Nov 12, 2021
ee5d30e
fixed copyright years in aggregation.pxd
skirui-source Nov 12, 2021
f3b85d1
minor review fixes- list comprehension, rm breakpoints
skirui-source Nov 12, 2021
0ea44ba
Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …
skirui-source Nov 15, 2021
b223925
apply @isvoid suggestions
skirui-source Nov 15, 2021
68dcb96
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Nov 17, 2021
20c9273
reversed copyright fix in cudf/_lib/aggregation.pxd
skirui-source Nov 17, 2021
af08150
use existing dataframe for corr() example
skirui-source Nov 17, 2021
cee2494
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Nov 17, 2021
2bdc33e
use existing dataframe for corr() example
skirui-source Nov 17, 2021
df616d0
noted that corr() is supported with decimals in the cudf docs
skirui-source Nov 17, 2021
94f1984
reversed copyright year in cudf/_lib/aggregation.pxd
skirui-source Nov 17, 2021
663a71b
addressed all reviews for groupby.py
skirui-source Nov 17, 2021
982d79d
Update python/cudf/cudf/core/groupby/groupby.py
shwina Nov 18, 2021
088eb74
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Nov 22, 2021
30e622e
Merge branch 'pearson_coeff' of github.com:skirui-source/cudf into pe…
skirui-source Nov 22, 2021
8ee70d1
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Nov 24, 2021
53465bb
addressed Vyas reviews
skirui-source Nov 30, 2021
f36ab44
updated API link with corr in api_docs/groupby.rst
skirui-source Nov 30, 2021
28d0a0a
.
skirui-source Nov 30, 2021
349c7a5
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Nov 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/cudf/source/basics/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,13 @@ Aggregations on groups is supported via the ``agg`` method:
a
1 4 1 2.0
2 5 2 4.5
>>> df.groupby("a").corr(method="pearson")
b c
a
1 b 1.000000 0.866025
c 0.866025 1.000000
2 b 1.000000 1.000000
c 1.000000 1.000000

The following table summarizes the available aggregations and the types
that support them:
Expand Down Expand Up @@ -169,6 +176,9 @@ that support them:
+------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+
| unique | ✅ | ✅ | ✅ | ✅ | | | | |
+------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+
| corr | ✅ | | | | | | | ✅ |
+------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+


GroupBy apply
-------------
Expand Down
55 changes: 53 additions & 2 deletions python/cudf/cudf/_lib/aggregation.pyx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Copyright (c) 2020, NVIDIA CORPORATION.
# Copyright (c) 2020-2021, NVIDIA CORPORATION.

from enum import Enum
from enum import Enum, IntEnum

import numba
import numpy as np
Expand Down Expand Up @@ -30,6 +30,7 @@ from cudf._lib.types import Interpolation

cimport cudf._lib.cpp.aggregation as libcudf_aggregation
cimport cudf._lib.cpp.types as libcudf_types
from cudf._lib.cpp.aggregation cimport underlying_type_t_correlation_type
skirui-source marked this conversation as resolved.
Show resolved Hide resolved

import cudf

Expand Down Expand Up @@ -57,6 +58,22 @@ class AggregationKind(Enum):
UNIQUE = libcudf_aggregation.aggregation.Kind.COLLECT_SET
PTX = libcudf_aggregation.aggregation.Kind.PTX
CUDA = libcudf_aggregation.aggregation.Kind.CUDA
CORRELATION = libcudf_aggregation.aggregation.Kind.CORRELATION
skirui-source marked this conversation as resolved.
Show resolved Hide resolved


class CorrelationType(IntEnum):
PEARSON = (
<underlying_type_t_correlation_type>
libcudf_aggregation.correlation_type.PEARSON
)
KENDALL = (
<underlying_type_t_correlation_type>
libcudf_aggregation.correlation_type.KENDALL
)
SPEARMAN = (
<underlying_type_t_correlation_type>
libcudf_aggregation.correlation_type.SPEARMAN
)


cdef class Aggregation:
Expand Down Expand Up @@ -321,6 +338,22 @@ cdef class Aggregation:
))
return agg

@classmethod
def corr(cls, method, libcudf_types.size_type min_periods):
cdef Aggregation agg = cls()
cdef libcudf_aggregation.correlation_type c_method = (
<libcudf_aggregation.correlation_type> (
<underlying_type_t_correlation_type> (
CorrelationType[method.upper()]
)
)
)
agg.c_obj = move(
libcudf_aggregation.make_correlation_aggregation[aggregation](
c_method, min_periods
))
return agg

cdef class RollingAggregation:
"""A Cython wrapper for rolling window aggregations.

Expand Down Expand Up @@ -692,6 +725,24 @@ cdef class GroupbyAggregation:
)
return agg

@classmethod
def corr(cls, method, libcudf_types.size_type min_periods):
cdef GroupbyAggregation agg = cls()
cdef libcudf_aggregation.correlation_type c_method = (
<libcudf_aggregation.correlation_type> (
<underlying_type_t_correlation_type> (
CorrelationType[method.upper()]
)
)
)
agg.c_obj = move(
libcudf_aggregation.
make_correlation_aggregation[groupby_aggregation](
c_method, min_periods
))
return agg


cdef class GroupbyScanAggregation:
"""A Cython wrapper for groupby scan aggregations.

Expand Down
15 changes: 13 additions & 2 deletions python/cudf/cudf/_lib/cpp/aggregation.pxd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Copyright (c) 2020, NVIDIA CORPORATION.

# Copyright (c) 2020-2021, NVIDIA CORPORATION.
from libc.stdint cimport int32_t
from libcpp.memory cimport unique_ptr
from libcpp.string cimport string
from libcpp.vector cimport vector
Expand All @@ -11,6 +11,7 @@ from cudf._lib.cpp.types cimport (
size_type,
)

ctypedef int32_t underlying_type_t_correlation_type

cdef extern from "cudf/aggregation.hpp" namespace "cudf" nogil:

Expand Down Expand Up @@ -38,6 +39,8 @@ cdef extern from "cudf/aggregation.hpp" namespace "cudf" nogil:
COLLECT_SET 'cudf::aggregation::COLLECT_SET'
PTX 'cudf::aggregation::PTX'
CUDA 'cudf::aggregation::CUDA'
CORRELATION 'cudf::aggregation::CORRELATION'

Kind kind

cdef cppclass rolling_aggregation:
Expand All @@ -53,6 +56,11 @@ cdef extern from "cudf/aggregation.hpp" namespace "cudf" nogil:
CUDA 'cudf::udf_type::CUDA'
PTX 'cudf::udf_type::PTX'

ctypedef enum correlation_type:
PEARSON 'cudf::correlation_type::PEARSON'
KENDALL 'cudf::correlation_type::KENDALL'
SPEARMAN 'cudf::correlation_type::SPEARMAN'

cdef unique_ptr[T] make_sum_aggregation[T]() except +

cdef unique_ptr[T] make_product_aggregation[T]() except +
Expand Down Expand Up @@ -106,3 +114,6 @@ cdef extern from "cudf/aggregation.hpp" namespace "cudf" nogil:
udf_type type,
string user_defined_aggregator,
data_type output_type) except +

cdef unique_ptr[T] make_correlation_aggregation[T](
correlation_type type, size_type min_periods) except +
4 changes: 2 additions & 2 deletions python/cudf/cudf/_lib/groupby.pyx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020, NVIDIA CORPORATION.
# Copyright (c) 2020-2021, NVIDIA CORPORATION.

from collections import defaultdict

Expand Down Expand Up @@ -54,7 +54,7 @@ _CATEGORICAL_AGGS = {"COUNT", "SIZE", "NUNIQUE", "UNIQUE"}
_STRING_AGGS = {"COUNT", "SIZE", "MAX", "MIN", "NUNIQUE", "NTH", "COLLECT",
"UNIQUE"}
_LIST_AGGS = {"COLLECT"}
_STRUCT_AGGS = set()
_STRUCT_AGGS = {"CORRELATION"}
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
_INTERVAL_AGGS = set()
_DECIMAL_AGGS = {"COUNT", "SUM", "ARGMIN", "ARGMAX", "MIN", "MAX", "NUNIQUE",
"NTH", "COLLECT"}
Expand Down
120 changes: 119 additions & 1 deletion python/cudf/cudf/core/groupby/groupby.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Copyright (c) 2020-2021, NVIDIA CORPORATION.

import collections
import itertools
import pickle
import warnings

Expand All @@ -13,7 +14,8 @@
from cudf._typing import DataFrameOrSeries
from cudf.api.types import is_list_like
from cudf.core.abc import Serializable
from cudf.core.column.column import arange
from cudf.core.column.column import arange, as_column
from cudf.core.index import _index_from_data
from cudf.utils.utils import GetAttrGetItemMixin, cached_property


Expand Down Expand Up @@ -69,6 +71,8 @@ def __init__(
"""
self.obj = obj
self._as_index = as_index
self._by = by
self._level = level
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
self._sort = sort
self._dropna = dropna

Expand Down Expand Up @@ -777,6 +781,120 @@ def median(self):
"""Get the column-wise median of the values in each group."""
return self.agg("median")

def corr(self, method="pearson", min_periods=1):
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
"""
Compute pairwise correlation of columns, excluding NA/null values.

Parameters
----------
method: {"pearson" (default), "kendall", "spearman"} or callable
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
Currently only the pearson correlation coefficient is supported.

min_periods: int, optional
Minimum number of observations required per pair of columns
to have a valid result.

Returns
----------
DataFrame
Correlation matrix.

Examples
--------
>>> import cudf
>>> gdf = cudf.DataFrame({
... "id": ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
... "val1": [5, 4, 6, 4, 8, 7, 4, 5, 2],
... "val2": [4, 5, 6, 1, 2, 9, 8, 5, 1],
... "val3": [4, 5, 6, 1, 2, 9, 8, 5, 1]})
>>> gdf
id val1 val2 val3
0 a 5 4 4
1 a 4 5 5
2 a 6 6 6
3 b 4 1 1
4 b 8 2 2
5 b 7 9 9
6 c 4 8 8
7 c 5 5 5
8 c 2 1 1
>>> gdf.groupby("id").corr(method="pearson")
val1 val2 val3
id
a val1 1.000000 0.500000 0.500000
val2 0.500000 1.000000 1.000000
val3 0.500000 1.000000 1.000000
b val1 1.000000 0.385727 0.385727
val2 0.385727 1.000000 1.000000
val3 0.385727 1.000000 1.000000
c val1 1.000000 0.714575 0.714575
val2 0.714575 1.000000 1.000000
val3 0.714575 1.000000 1.000000

"""
skirui-source marked this conversation as resolved.
Show resolved Hide resolved

if not method.lower() in ["pearson"]:
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
raise NotImplementedError(
"Only pearson correlation is currently supported"
)

# create expanded dataframe consisting all combinations of the
# struct columns-pairs to be correlated
# i.e (('col1', 'col1'), ('col1', 'col2'), ('col2', 'col2'))
_cols = self.grouping.values.columns.tolist()

new_df_data = {}
for x, y in itertools.combinations_with_replacement(_cols, 2):
new_df_data[(x, y)] = cudf.DataFrame._from_data(
{"x": self.obj._data[x], "y": self.obj._data[y]}
).to_struct()
new_gb = cudf.DataFrame._from_data(new_df_data).groupby(
by=self.grouping.keys
)

try:
gb_corr = new_gb.agg(lambda x: x.corr(method, min_periods))
except RuntimeError as e:
if "Unsupported groupby reduction type-agg combination" in str(e):
raise TypeError(
"Correlation accepts only numerical column-pairs"
)
raise

# ensure that column-pair labels are arranged in ascending order
cols_list = [
(_cols[j], _cols[i]) if i > j else (_cols[i], _cols[j])
for j, y in enumerate(_cols)
for i, x in enumerate(_cols)
]
shwina marked this conversation as resolved.
Show resolved Hide resolved
cols_split = [
cols_list[i : i + len(_cols)]
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
for i in range(0, len(cols_list), len(_cols))
]

# interleave: combine the correlation results for each column-pair
# into a single column
res = cudf.DataFrame._from_data(
{
x: gb_corr.loc[:, i].interleave_columns()
for i, x in zip(cols_split, _cols)
}
)

# create a multiindex for the groupby correlated dataframe,
# to match pandas behavior
unsorted_idx = gb_corr.index.repeat(len(_cols))
idx_sort_order = unsorted_idx._get_sorted_inds()
sorted_idx = unsorted_idx._gather(idx_sort_order)
if len(gb_corr):
# TO-DO: Should the operation below be done on the CPU instead?
sorted_idx._data[None] = as_column(
cudf.Series(_cols).tile(len(gb_corr.index))
)
res.index = _index_from_data(sorted_idx._data)
skirui-source marked this conversation as resolved.
Show resolved Hide resolved

return res

def var(self, ddof=1):
"""Compute the column-wise variance of the values in each group.

Expand Down
Loading