Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Groupby correlation (Pearson) #8691

Closed
beckernick opened this issue Jul 8, 2021 · 10 comments · Fixed by #9154 or #9166
Closed

[FEA] Groupby correlation (Pearson) #8691

beckernick opened this issue Jul 8, 2021 · 10 comments · Fixed by #9154 or #9166
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jul 8, 2021

I'd like to be able to calculate the correlation of my value columns on a per-group basis (groupby correlation). As an example, a data scientist might hypothesize that the sales or usage patterns of two products are more or less correlated on certain days of the week than others, which could be valuable information. To run that analysis, they'd like to do something like a groupby correlation where the key is the day of the week and the value columns are the sales (usage) patterns of each product.

As noted in #1267 (comment), supporting correlation (and implicitly covariance) in the groupby machinery might potentially require additional design. Unlike something like sum which operates on a single column, correlation operates on two columns so the aggregation takes more than one input. In Spark, the corr function takes two inputs and returns the per-group correlation of the input columns. In Pandas, corr will return the full pairwise correlation matrix using all columns in the dataframe.

Today, Spark only supports Pearson correlation, which is the default in pandas (though pandas supports additional methods).

Examples below.

Pandas:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd

df = pd.DataFrame({
    "key": [0]*4 + [1]*3,
    "a": [10,3,4,2,-3,9,10],
    "b": [10,23,-4,2,-3,9,19],
    "c": [10,-23,-4,21,-3,19,19],
})
​
print(df.groupby("key").corr())
              a         b         c
key                                
0   a  1.000000  0.077471  0.185581
    b  0.077471  1.000000 -0.604482
    c  0.185581 -0.604482  1.000000
1   a  1.000000  0.920285  0.997609
    b  0.920285  1.000000  0.891042
    c  0.997609  0.891042  1.000000

Spark:

sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView("df")
​
sdf.groupby("key").agg(F.corr("a", "b")).show()  # could just as easily be Spark SQL
+---+------------------+
|key|        corr(a, b)|
+---+------------------+
|  0|0.0774711279979589|
|  1|0.9202846173114504|
+---+------------------+
@beckernick beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jul 8, 2021
@jrhemstad
Copy link
Contributor

Like @beckernick mentioned, this may require some thinking on if/how the groupby C++ interface may need to change. Currently groupby::aggregate takes a list of aggregation_requests, which is just a column and a list of aggregations to perform on that column:

struct aggregation_request {
column_view values; ///< The elements to aggregate
std::vector<std::unique_ptr<aggregation>> aggregations; ///< Desired aggregations
};

Correlation is unique in that it's not just operating on a single column, but instead two columns.

Initial ideas:

  • Make aggregation_request take multiple columns?
    • This just kind feels wrong. If a request allows multiple columns, there's no easy control over mapping an aggregation to the number of column it expects
  • Make the column you're correlating against an argument to the pearson_correlation aggregation. e.g., if we wanted to do a groupby pearson correlation of col_a with col_b it might look like:
auto agg = make_pearson_agg(col_b);
auto req = aggregation_request(col_a, {agg}); // form a request to do a pearson agg of col_a against b
groupby(keys).aggregate(req);

The only thing I don't like about this is there's an implicit requirement that col_a and col_b be the same size/type and especially order that isn't explicitly enforced.

I'll keep thinking on it as there may be better ideas yet.

@revans2
Copy link
Contributor

revans2 commented Jul 9, 2021

Another option might be to have the correlation take a non-nullable struct column. If you were just operating on column views it would be very fast because it is just metadata pointing to the child columns.

@jrhemstad
Copy link
Contributor

I like the idea of making the values column a struct column. That solves my complaint above about enforcing the columns be linked.

@beckernick beckernick added this to the Time Series Analysis milestone Jul 27, 2021
@skirui-source skirui-source self-assigned this Aug 26, 2021
@karthikeyann
Copy link
Contributor

karthikeyann commented Aug 27, 2021

Assuming, input view is a non-nullable struct column (depth=1), output is also a non-nullable struct column of size num_groups*num_children. (All child columns are double type).

Pearson correlation needs MEAN, COUNT_VALID, STD. Struct support is not available yet for these aggregations.
Also, each final result column is interleaved.

One idea is to create a flattened iterator of this non-nullable struct view and use reduce_by_key instead of working on each child column individually.
I am working on this idea. (sort groupby)
Final interleaving can be done by scatter.

@karthikeyann karthikeyann self-assigned this Aug 27, 2021
@jrhemstad
Copy link
Contributor

@karthikeyann I don't understand your comment.

Are you saying there is an issue with the proposed solution of passing the aggregated values as a struct column?

@karthikeyann
Copy link
Contributor

No. struct column sounds perfect for this purpose. I am just mentioning the idea that I am working on to implement correlation. (and expecting any feedback).

Input: struct{col_a(size), col_b(size), ...N}
Output: struct{col_a(group_size*N), ...N};

Because of libcudf limitation that MEAN, COUNT_VALID, STD doesn't support struct column yet, we can't directly call MEAN on this struct view.
So, above comment is an idea to overcome this limitation. (and also maximize parallelism).

@jrhemstad
Copy link
Contributor

jrhemstad commented Aug 27, 2021

I am confused. Here's what I understand:

The input is a struct column with exactly two children: X and Y (the two populations to correlate).

The output should then be just a singular value per group.

We shouldn't need to be computing mean/count/std on structs. I would first compute the covariance of X,Y per group (which decomposes into computing the mean of X and Y), then the stddev per group in X and Y, then finally put it all together into the Pearson Correlation.

@jrhemstad
Copy link
Contributor

jrhemstad commented Aug 27, 2021

I see what the confusion is. Pandas in its infinite wisdom will do the cross-product of all pair-wise correlations among the aggregated columns (including redundantly having (a,b) and (b,a)). That's not something we're going to try and support directly in libcudf.

In the cuDF Python layer that aggregation call can be translated into a set of pair-wise correlation aggregation requests (a,a), (a,b), (a,c), etc. (Personally, I'd trim out the (self,self) aggregations since those will always be 1). And do the necessary massaging to get the ordering like the Pandas result.

@karthikeyann
Copy link
Contributor

karthikeyann commented Aug 27, 2021

Thanks. That clarifies the confusion.

If the python layer sends multiple requests with combination of all columns (eg. {a,b}, {b, c}, {c, a}).
MEAN, COUNT_VALID are not calculated on child columns. (limitation of the cache in sort groupby)

We could directly call group_mean(), group_count_valid().
But MEAN, COUNT_VALID is computed every time (in this eg, twice) . It's not cached.

@jrhemstad
Copy link
Contributor

Refactoring the groupby cache to make use of #9140 and #9139 should resolve those concerns.

rapids-bot bot pushed a commit that referenced this issue Oct 18, 2021
Add sort-groupby covariance and Pearson correlation in libcudf 
Addresses part of #1268 (groupby covariance)
Addresses part of #8691 (groupby Pearson correlation)
depends on PR #9195

For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (`aggregation_request::values` = `struct_column_view{x, y}`)

```
covariance = Sum((x-mean_x)*(y-mean_y)) / (group_size-ddof)
Pearson correlation = covariance/ xstddev / ystddev
```

x, y values both should be non-null. 
mean, stddev, count should be calculated on only common non-null values of both columns.

mean, stddev, count of child columns are cached.
One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same).

Unit tests for covariance and pearson correlation added.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Sheilah Kirui (https://github.com/skirui-source)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - https://github.com/nvdbaranec

URL: #9154
@karthikeyann karthikeyann reopened this Oct 18, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 26, 2021
#9492)

Addresses part of #8691
Add min_periods and ddof parameters to libcudf groupby covariance and Pearson correlation (python needs this)

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #9492
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
5 participants