-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Groupby correlation (Pearson) #8691
Comments
Like @beckernick mentioned, this may require some thinking on if/how the cudf/cpp/include/cudf/groupby.hpp Lines 58 to 61 in a5427d2
Correlation is unique in that it's not just operating on a single column, but instead two columns. Initial ideas:
The only thing I don't like about this is there's an implicit requirement that I'll keep thinking on it as there may be better ideas yet. |
Another option might be to have the correlation take a non-nullable struct column. If you were just operating on column views it would be very fast because it is just metadata pointing to the child columns. |
I like the idea of making the |
Assuming, input view is a non-nullable struct column (depth=1), output is also a non-nullable struct column of size Pearson correlation needs MEAN, COUNT_VALID, STD. Struct support is not available yet for these aggregations. One idea is to create a flattened iterator of this non-nullable struct view and use |
@karthikeyann I don't understand your comment. Are you saying there is an issue with the proposed solution of passing the aggregated |
No. struct column sounds perfect for this purpose. I am just mentioning the idea that I am working on to implement correlation. (and expecting any feedback).
Because of libcudf limitation that MEAN, COUNT_VALID, STD doesn't support struct column yet, we can't directly call MEAN on this struct view. |
I am confused. Here's what I understand: The input is a struct column with exactly two children: X and Y (the two populations to correlate). The output should then be just a singular value per group. We shouldn't need to be computing mean/count/std on structs. I would first compute the covariance of X,Y per group (which decomposes into computing the mean of X and Y), then the stddev per group in X and Y, then finally put it all together into the Pearson Correlation. |
I see what the confusion is. Pandas in its infinite wisdom will do the cross-product of all pair-wise correlations among the aggregated columns (including redundantly having (a,b) and (b,a)). That's not something we're going to try and support directly in libcudf. In the cuDF Python layer that aggregation call can be translated into a set of pair-wise correlation aggregation requests (a,a), (a,b), (a,c), etc. (Personally, I'd trim out the (self,self) aggregations since those will always be 1). And do the necessary massaging to get the ordering like the Pandas result. |
Thanks. That clarifies the confusion. If the python layer sends multiple requests with combination of all columns (eg. {a,b}, {b, c}, {c, a}). We could directly call |
Add sort-groupby covariance and Pearson correlation in libcudf Addresses part of #1268 (groupby covariance) Addresses part of #8691 (groupby Pearson correlation) depends on PR #9195 For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (`aggregation_request::values` = `struct_column_view{x, y}`) ``` covariance = Sum((x-mean_x)*(y-mean_y)) / (group_size-ddof) Pearson correlation = covariance/ xstddev / ystddev ``` x, y values both should be non-null. mean, stddev, count should be calculated on only common non-null values of both columns. mean, stddev, count of child columns are cached. One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same). Unit tests for covariance and pearson correlation added. Authors: - Karthikeyan (https://github.com/karthikeyann) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/nvdbaranec URL: #9154
#9492) Addresses part of #8691 Add min_periods and ddof parameters to libcudf groupby covariance and Pearson correlation (python needs this) Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Devavret Makkar (https://github.com/devavret) - Jake Hemstad (https://github.com/jrhemstad) URL: #9492
Fixes: #8691 Authors: - Sheilah Kirui (https://github.com/skirui-source) - Karthikeyan (https://github.com/karthikeyann) - Ashwin Srinath (https://github.com/shwina) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Michael Wang (https://github.com/isVoid) - Mayank Anand (https://github.com/mayankanand007) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9166
I'd like to be able to calculate the correlation of my value columns on a per-group basis (groupby correlation). As an example, a data scientist might hypothesize that the sales or usage patterns of two products are more or less correlated on certain days of the week than others, which could be valuable information. To run that analysis, they'd like to do something like a groupby correlation where the key is the day of the week and the value columns are the sales (usage) patterns of each product.
As noted in #1267 (comment), supporting correlation (and implicitly covariance) in the groupby machinery might potentially require additional design. Unlike something like sum which operates on a single column, correlation operates on two columns so the aggregation takes more than one input. In Spark, the
corr
function takes two inputs and returns the per-group correlation of the input columns. In Pandas,corr
will return the full pairwise correlation matrix using all columns in the dataframe.Today, Spark only supports Pearson correlation, which is the default in pandas (though pandas supports additional methods).
Examples below.
Pandas:
Spark:
The text was updated successfully, but these errors were encountered: