[FEA] Groupby correlation (Pearson) #8691

beckernick · 2021-07-08T17:47:39Z

I'd like to be able to calculate the correlation of my value columns on a per-group basis (groupby correlation). As an example, a data scientist might hypothesize that the sales or usage patterns of two products are more or less correlated on certain days of the week than others, which could be valuable information. To run that analysis, they'd like to do something like a groupby correlation where the key is the day of the week and the value columns are the sales (usage) patterns of each product.

As noted in #1267 (comment), supporting correlation (and implicitly covariance) in the groupby machinery might potentially require additional design. Unlike something like sum which operates on a single column, correlation operates on two columns so the aggregation takes more than one input. In Spark, the corr function takes two inputs and returns the per-group correlation of the input columns. In Pandas, corr will return the full pairwise correlation matrix using all columns in the dataframe.

Today, Spark only supports Pearson correlation, which is the default in pandas (though pandas supports additional methods).

Examples below.

Pandas:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd

df = pd.DataFrame({
    "key": [0]*4 + [1]*3,
    "a": [10,3,4,2,-3,9,10],
    "b": [10,23,-4,2,-3,9,19],
    "c": [10,-23,-4,21,-3,19,19],
})

print(df.groupby("key").corr())
              a         b         c
key                                
0   a  1.000000  0.077471  0.185581
    b  0.077471  1.000000 -0.604482
    c  0.185581 -0.604482  1.000000
1   a  1.000000  0.920285  0.997609
    b  0.920285  1.000000  0.891042
    c  0.997609  0.891042  1.000000

Spark:

sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView("df")

sdf.groupby("key").agg(F.corr("a", "b")).show()  # could just as easily be Spark SQL
+---+------------------+
|key|        corr(a, b)|
+---+------------------+
|  0|0.0774711279979589|
|  1|0.9202846173114504|
+---+------------------+

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-07-08T18:56:29Z

Like @beckernick mentioned, this may require some thinking on if/how the groupby C++ interface may need to change. Currently groupby::aggregate takes a list of aggregation_requests, which is just a column and a list of aggregations to perform on that column:

cudf/cpp/include/cudf/groupby.hpp

Lines 58 to 61 in a5427d2

    
           struct aggregation_request { 
        
             column_view values;                                      ///< The elements to aggregate 
        
             std::vector<std::unique_ptr<aggregation>> aggregations;  ///< Desired aggregations 
        
           };

Correlation is unique in that it's not just operating on a single column, but instead two columns.

Initial ideas:

Make aggregation_request take multiple columns?
- This just kind feels wrong. If a request allows multiple columns, there's no easy control over mapping an aggregation to the number of column it expects
Make the column you're correlating against an argument to the pearson_correlation aggregation. e.g., if we wanted to do a groupby pearson correlation of col_a with col_b it might look like:

auto agg = make_pearson_agg(col_b);
auto req = aggregation_request(col_a, {agg}); // form a request to do a pearson agg of col_a against b
groupby(keys).aggregate(req);

The only thing I don't like about this is there's an implicit requirement that col_a and col_b be the same size/type and especially order that isn't explicitly enforced.

I'll keep thinking on it as there may be better ideas yet.

revans2 · 2021-07-09T15:16:01Z

Another option might be to have the correlation take a non-nullable struct column. If you were just operating on column views it would be very fast because it is just metadata pointing to the child columns.

jrhemstad · 2021-07-12T13:39:00Z

I like the idea of making the values column a struct column. That solves my complaint above about enforcing the columns be linked.

karthikeyann · 2021-08-27T19:46:48Z

Assuming, input view is a non-nullable struct column (depth=1), output is also a non-nullable struct column of size num_groups*num_children. (All child columns are double type).

Pearson correlation needs MEAN, COUNT_VALID, STD. Struct support is not available yet for these aggregations.
Also, each final result column is interleaved.

One idea is to create a flattened iterator of this non-nullable struct view and use reduce_by_key instead of working on each child column individually.
I am working on this idea. (sort groupby)
Final interleaving can be done by scatter.

jrhemstad · 2021-08-27T20:23:46Z

@karthikeyann I don't understand your comment.

Are you saying there is an issue with the proposed solution of passing the aggregated values as a struct column?

karthikeyann · 2021-08-27T20:40:05Z

No. struct column sounds perfect for this purpose. I am just mentioning the idea that I am working on to implement correlation. (and expecting any feedback).

Input: struct{col_a(size), col_b(size), ...N}
Output: struct{col_a(group_size*N), ...N};

Because of libcudf limitation that MEAN, COUNT_VALID, STD doesn't support struct column yet, we can't directly call MEAN on this struct view.
So, above comment is an idea to overcome this limitation. (and also maximize parallelism).

jrhemstad · 2021-08-27T20:53:00Z

I am confused. Here's what I understand:

The input is a struct column with exactly two children: X and Y (the two populations to correlate).

The output should then be just a singular value per group.

We shouldn't need to be computing mean/count/std on structs. I would first compute the covariance of X,Y per group (which decomposes into computing the mean of X and Y), then the stddev per group in X and Y, then finally put it all together into the Pearson Correlation.

jrhemstad · 2021-08-27T21:04:33Z

I see what the confusion is. Pandas in its infinite wisdom will do the cross-product of all pair-wise correlations among the aggregated columns (including redundantly having (a,b) and (b,a)). That's not something we're going to try and support directly in libcudf.

In the cuDF Python layer that aggregation call can be translated into a set of pair-wise correlation aggregation requests (a,a), (a,b), (a,c), etc. (Personally, I'd trim out the (self,self) aggregations since those will always be 1). And do the necessary massaging to get the ordering like the Pandas result.

karthikeyann · 2021-08-27T21:18:37Z

Thanks. That clarifies the confusion.

If the python layer sends multiple requests with combination of all columns (eg. {a,b}, {b, c}, {c, a}).
MEAN, COUNT_VALID are not calculated on child columns. (limitation of the cache in sort groupby)

We could directly call group_mean(), group_count_valid().
But MEAN, COUNT_VALID is computed every time (in this eg, twice) . It's not cached.

jrhemstad · 2021-08-27T22:02:18Z

Refactoring the groupby cache to make use of #9140 and #9139 should resolve those concerns.

Add sort-groupby covariance and Pearson correlation in libcudf Addresses part of #1268 (groupby covariance) Addresses part of #8691 (groupby Pearson correlation) depends on PR #9195 For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (`aggregation_request::values` = `struct_column_view{x, y}`) ``` covariance = Sum((x-mean_x)*(y-mean_y)) / (group_size-ddof) Pearson correlation = covariance/ xstddev / ystddev ``` x, y values both should be non-null. mean, stddev, count should be calculated on only common non-null values of both columns. mean, stddev, count of child columns are cached. One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same). Unit tests for covariance and pearson correlation added. Authors: - Karthikeyan (https://github.com/karthikeyann) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/nvdbaranec URL: #9154

#9492) Addresses part of #8691 Add min_periods and ddof parameters to libcudf groupby covariance and Pearson correlation (python needs this) Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Devavret Makkar (https://github.com/devavret) - Jake Hemstad (https://github.com/jrhemstad) URL: #9492

Fixes: #8691 Authors: - Sheilah Kirui (https://github.com/skirui-source) - Karthikeyan (https://github.com/karthikeyann) - Ashwin Srinath (https://github.com/shwina) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Michael Wang (https://github.com/isVoid) - Mayank Anand (https://github.com/mayankanand007) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9166

beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jul 8, 2021

beckernick added this to the Time Series Analysis milestone Jul 27, 2021

skirui-source self-assigned this Aug 26, 2021

karthikeyann self-assigned this Aug 27, 2021

skirui-source linked a pull request Aug 31, 2021 that will close this issue

Add Covariance, Pearson correlation for sort groupby (libcudf) #9154

Merged

skirui-source mentioned this issue Sep 2, 2021

Add Pearson correlation for sort groupby (python) #9166

Merged

karthikeyann mentioned this issue Oct 4, 2021

Add Covariance, Pearson correlation for sort groupby (libcudf) #9154

Merged

rapids-bot bot closed this as completed in #9154 Oct 18, 2021

karthikeyann reopened this Oct 18, 2021

karthikeyann mentioned this issue Oct 21, 2021

add min_periods, ddof to groupby covariance, & correlation aggregation #9492

Merged

rapids-bot bot closed this as completed in #9166 Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Groupby correlation (Pearson) #8691

[FEA] Groupby correlation (Pearson) #8691

beckernick commented Jul 8, 2021 •

edited

Loading

jrhemstad commented Jul 8, 2021

revans2 commented Jul 9, 2021

jrhemstad commented Jul 12, 2021

karthikeyann commented Aug 27, 2021 •

edited

Loading

jrhemstad commented Aug 27, 2021

karthikeyann commented Aug 27, 2021

jrhemstad commented Aug 27, 2021 •

edited

Loading

jrhemstad commented Aug 27, 2021 •

edited

Loading

karthikeyann commented Aug 27, 2021 •

edited

Loading

jrhemstad commented Aug 27, 2021

[FEA] Groupby correlation (Pearson) #8691

[FEA] Groupby correlation (Pearson) #8691

Comments

beckernick commented Jul 8, 2021 • edited Loading

jrhemstad commented Jul 8, 2021

revans2 commented Jul 9, 2021

jrhemstad commented Jul 12, 2021

karthikeyann commented Aug 27, 2021 • edited Loading

jrhemstad commented Aug 27, 2021

karthikeyann commented Aug 27, 2021

jrhemstad commented Aug 27, 2021 • edited Loading

jrhemstad commented Aug 27, 2021 • edited Loading

karthikeyann commented Aug 27, 2021 • edited Loading

jrhemstad commented Aug 27, 2021

beckernick commented Jul 8, 2021 •

edited

Loading

karthikeyann commented Aug 27, 2021 •

edited

Loading

jrhemstad commented Aug 27, 2021 •

edited

Loading

jrhemstad commented Aug 27, 2021 •

edited

Loading

karthikeyann commented Aug 27, 2021 •

edited

Loading