-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement all methods of groupby rank aggregation in libcudf, python #9569
Implement all methods of groupby rank aggregation in libcudf, python #9569
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #9569 +/- ##
================================================
+ Coverage 86.36% 86.40% +0.03%
================================================
Files 142 142
Lines 22302 22312 +10
================================================
+ Hits 19261 19278 +17
+ Misses 3041 3034 -7
Continue to review full report at Codecov.
|
…fea-groupby_rank_full
…fea-groupby_rank_full
…fea-groupby_rank_full
…fea-groupby_rank_full
…fea-groupby_rank_full
This PR has been labeled |
…fea-groupby_rank_full
On the question of making I should point out that all the SQL engines that I have tested/checked (including Hive, Impala, Spark, Presto, Drill, Oracle, and MySQL) follow the ANSI SQL conventions detailed here:
Here's an illustration from MySQL: mysql> select *, rank() over ( order by num_legs ) as `rank`, percent_rank() over ( order by num_legs ) as `percent_rank` from animals;
+---------+----------+------+--------------+
| animal | num_legs | rank | percent_rank |
+---------+----------+------+--------------+
| snake | NULL | 1 | 0 |
| penguin | 2 | 2 | 0.25 |
| cat | 4 | 3 | 0.5 |
| dog | 4 | 3 | 0.5 |
| spider | 8 | 5 | 1 |
+---------+----------+------+--------------+ My guess is that DaskSQL would prefer to follow the same convention in the future. >>> df['default_rank'] = df['Number_legs'].rank()
>>> df['max_rank'] = df['Number_legs'].rank(method='max')
>>> df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
>>> df['pct_rank'] = df['Number_legs'].rank(pct=True)
>>> df
Animal Number_legs default_rank max_rank NA_bottom pct_rank
0 cat 4.0 2.5 3.0 2.5 0.625
1 penguin 2.0 1.0 1.0 1.0 0.250
2 dog 4.0 2.5 3.0 2.5 0.625
3 spider 8.0 4.0 4.0 4.0 1.000
4 snake NaN NaN NaN 5.0 NaN When/if |
Apropos, I'm leaning towards the following:
#3 above warrants justification, and discussion. On first glance, it would appear that ANSI SQL's
Also, there is a 1-1 correspondence between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a little concerned about the sorting discussion and would like to make sure that we address that in our documentation, but as long as that discussion is resolved everything else here looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the C++/Java bits. Also, tested this with Spark integration. Things look good.
class percent_rank_aggregation final : public rolling_aggregation, | ||
public groupby_scan_aggregation, | ||
public scan_aggregation { | ||
class ansi_sql_percent_rank_aggregation final : public rolling_aggregation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a distinct aggregation type instead of an argument to the normal rank_aggregation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer #9569 (comment)
- Keep
PERCENT_RANK
as a separate aggregation. (Explained below.)#3 above warrants justification, and discussion. On first glance, it would appear that ANSI SQL's
PERCENT_RANK
semantics could be addressed similarly to Pandas'sDataframe.rank(method='MIN', pct='True')
. I suspect it can't, because:pandas_min_percent_rank == (row_rank / num_rows_in_group); sql_percent_rank == ((row_rank - 1) / (num_rows_in_group-1));
Also, there is a 1-1 correspondence between
cudf::rank_method
and Pandas'sDataframe.rank(method)
. It would be erroneous/nonsensical to attemptPERCENT_RANK
aggregations withpct=True
. My vote is currently to leavePERCENT_RANK
separate. There might be value in renamingPERCENT_RANK
toANSI_SQL_PERCENT_RANK
, to clarify that this is ANSI SQL compliant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karthikeyann @mythrocks that doesn't sound like sufficient justification for a distinct aggregation when the only difference is a slight change in how the "percentage" is calculated.
For example, look at the standard deviation aggregation and how it exposes the ddof
parameter to control a constant offset to the population size:
cudf/cpp/include/cudf/aggregation.hpp
Lines 252 to 261 in 565f474
/** | |
* @brief Factory to create a STD aggregation | |
* | |
* @param ddof Delta degrees of freedom. The divisor used in calculation of | |
* `std` is `N - ddof`, where `N` is the population size. | |
* | |
* @throw cudf::logic_error if input type is chrono or compound types. | |
*/ | |
template <typename Base = aggregation> | |
std::unique_ptr<Base> make_std_aggregation(size_type ddof = 1); |
Instead of making the percentage
argument to make_rank_aggregation
be a bool
, make it something where you can expose control over how the percentage is calculated.
Do not call it sql
and pandas
methods. It should be a red flag if you have to refer to a specific framework for naming something. It should just be described in terms of the math.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To put in plain words about the difference between sql
and pandas
percentage,
sql
percentage rank ranges from 0% to 100% (only applies for MIN method, all other methods are not applicable).
pandas
percentage rank ranges from >0% to 100%. (all methods are applicable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To describe in terms of math, how about adding new method named 0-indexed-min-rank
and pass percentage
true ? (but percentage
= false will not be supported.)
Note that We are dividing by maximum of group rank value (which is group_size-1 ), not the group size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrhemstad replaced ANSI_SQL_PERCENT_RANK
aggregation with MIN_0_INDEXED
rank_method
commit c988d8f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karthikeyann we had a longer discussion of this issue at the libcudf team meeting today to get aligned on what we want this API to look like. The current rank_method
s aside from the new MIN_0_INDEXED
are all about how ties are broken, which is very different from whether we compute a percentage or not. Our proposal is to instead use a separate enum like
enum class rank_percentage : int32_t {
NONE, ///< rank
ZERO_NORMALIZED, ///< rank / group_size
ONE_NORMALIZED ///< (rank - 1) / (group_size - 1)
};
so that control over the percentage or not part of the calculation is independent of the tiebreaking. @mythrocks was tentatively supportive of this, but he mentioned that there was potentially a technical blocker here because getting the group_size
inside the aggregation was a problem. Could you could identify and expand upon that problem? Figuring out where we have a problem there would help us move towards a better consensus solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion tracks closely with what @karthikeyann had suggested last week. Good idea.
I can't for the life of me remember what my reservation was regarding. If it turns out to be legitimate, it should turn up during implementation. If it doesn't, c'est la vie.
Count me in as a 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated code with the discussion solution. Added rank_percentage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrhemstad I think this resolves your question now.
added documentation in |
…fea-groupby_rank_full
…fea-groupby_rank_full
rerun tests |
@karthikeyann the changes seem to have broken the Java tests, any idea why? Once that's resolved I think this PR is good to merge. |
Fixed the bug. |
Thank you @mythrocks @vyasr @jrhemstad for all the inputs and the reviews! |
@gpucibot merge |
Addresses part of #3591
RANK, DENSE_RANK was implemented for spark requirement. Pandas groupby has 3 more methods.
rank(column_view, rank_method)
already has all 5 methods implemented.Current implementation has 2 separate aggregations RANK and DENSE_RANK. This is merged to single RANK with parameters
rank_aggregation(rank_method method, null_policy null_handling, bool percentage)
Groupby.rank support for 3 more methods will be added.
This PR is also pre-requisite for spearman correlation.
Additionally