Implement all methods of groupby rank aggregation in libcudf, python #9569

karthikeyann · 2021-10-29T20:45:02Z

Addresses part of #3591

move RANK (min method), DENSE_RANK (dense method) into single RANK aggregation
max method
average method
first method
percentage
order, null order
RANK, DENSE_RANK was implemented for spark requirement. Pandas groupby has 3 more methods. rank(column_view, rank_method) already has all 5 methods implemented.

Current implementation has 2 separate aggregations RANK and DENSE_RANK. This is merged to single RANK with parameters rank_aggregation(rank_method method, null_policy null_handling, bool percentage)
Groupby.rank support for 3 more methods will be added.

This PR is also pre-requisite for spearman correlation.

Additionally

Cython, Python plumbing
benchmark for groupby rank (all methods)
PERCENT_RANK aggregation is replaced with percentage argument ONE_NORMALIZED argument in RANK aggregation

codecov · 2021-10-29T23:30:34Z

Codecov Report

Merging #9569 (d20a783) into branch-22.06 (d6e3068) will increase coverage by 0.03%.
The diff coverage is 96.96%.

❗ Current head d20a783 differs from pull request most recent head 36c7a53. Consider uploading reports for the commit 36c7a53 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06    #9569      +/-   ##
================================================
+ Coverage         86.36%   86.40%   +0.03%     
================================================
  Files               142      142              
  Lines             22302    22312      +10     
================================================
+ Hits              19261    19278      +17     
+ Misses             3041     3034       -7

Impacted Files	Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/testing/dataset_generator.py	`73.25% <ø> (ø)`
python/cudf/cudf/testing/testing.py	`81.69% <ø> (ø)`
python/dask_cudf/dask_cudf/io/orc.py	`91.04% <ø> (ø)`
python/cudf/cudf/api/types.py	`89.36% <100.00%> (-0.44%)`	⬇️
python/cudf/cudf/core/_base_index.py	`85.71% <100.00%> (+0.25%)`	⬆️
python/cudf/cudf/core/column_accessor.py	`93.47% <100.00%> (ø)`
python/cudf/cudf/core/dtypes.py	`97.30% <100.00%> (ø)`
python/cudf/cudf/core/frame.py	`93.41% <100.00%> (ø)`
python/cudf/cudf/core/groupby/groupby.py	`91.79% <100.00%> (+0.34%)`	⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae7e979...36c7a53. Read the comment docs.

…fea-groupby_rank_full

github-actions · 2022-01-06T18:03:08Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

…fea-groupby_rank_full

mythrocks · 2022-02-24T05:44:23Z

On the question of making percent_rank() an option in rank_method, I think that might be possible. But we might have a challenge w.r.t conventions.

I should point out that all the SQL engines that I have tested/checked (including Hive, Impala, Spark, Presto, Drill, Oracle, and MySQL) follow the ANSI SQL conventions detailed here:

percent_rank = (rank - 1) / (rows - 1)

Here's an illustration from MySQL:

mysql> select *, rank() over ( order by num_legs ) as `rank`, percent_rank() over ( order by num_legs ) as `percent_rank` from animals;
+---------+----------+------+--------------+
| animal  | num_legs | rank | percent_rank |
+---------+----------+------+--------------+
| snake   |     NULL |    1 |            0 |
| penguin |        2 |    2 |         0.25 |
| cat     |        4 |    3 |          0.5 |
| dog     |        4 |    3 |          0.5 |
| spider  |        8 |    5 |            1 |
+---------+----------+------+--------------+

My guess is that DaskSQL would prefer to follow the same convention in the future.
Note how this differs from the Pandas conventions of using rank/num_rows:

>>> df['default_rank'] = df['Number_legs'].rank()
>>> df['max_rank'] = df['Number_legs'].rank(method='max')
>>> df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
>>> df['pct_rank'] = df['Number_legs'].rank(pct=True)
>>> df
    Animal  Number_legs  default_rank  max_rank  NA_bottom  pct_rank
0      cat          4.0           2.5       3.0        2.5     0.625
1  penguin          2.0           1.0       1.0        1.0     0.250
2      dog          4.0           2.5       3.0        2.5     0.625
3   spider          8.0           4.0       4.0        4.0     1.000
4    snake          NaN           NaN       NaN        5.0       NaN

When/if percent_rank is folded into rank_method, it would be good to have a way to choose between SQL and Pandas semantics. There are bound to be takers for both.

mythrocks · 2022-02-25T00:20:45Z

When/if percent_rank is folded into rank_method, it would be good to have a way to choose between SQL and Pandas semantics. There are bound to be takers for both.

Apropos, I'm leaning towards the following:

For grouping support in cudf::rank(), we delegate to groupby::sort_scan() (and scan_result_functor), since they're aware of groupby semantics for ranking. These already work for Spark. Pandas semantics seem to be identical.
Since Pandas allows for expressing all ranking functions as fractions (via the pct=True parameter), we add support for pct. Since Spark does not use this, the feature caters only to Pandas.
Keep PERCENT_RANK as a separate aggregation. (Explained below.)

#3 above warrants justification, and discussion. On first glance, it would appear that ANSI SQL's PERCENT_RANK semantics could be addressed similarly to Pandas's Dataframe.rank(method='MIN', pct='True').
I suspect it can't, because:

pandas_min_percent_rank == (row_rank / num_rows_in_group);
sql_percent_rank == ((row_rank - 1) / (num_rows_in_group-1));

Also, there is a 1-1 correspondence between cudf::rank_method and Pandas's Dataframe.rank(method). It would be erroneous/nonsensical to attempt PERCENT_RANK aggregations with pct=True.
My vote is currently to leave PERCENT_RANK separate. There might be value in renaming PERCENT_RANK to ANSI_SQL_PERCENT_RANK, to clarify that this is ANSI SQL compliant.

vyasr

I'm still a little concerned about the sorting discussion and would like to make sure that we address that in our documentation, but as long as that discussion is resolved everything else here looks great!

mythrocks

LGTM for the C++/Java bits. Also, tested this with Spark integration. Things look good.

cpp/benchmarks/groupby/group_rank_benchmark.cu

jrhemstad · 2022-04-15T19:02:31Z

cpp/include/cudf/detail/aggregation/aggregation.hpp

-class percent_rank_aggregation final : public rolling_aggregation,
-                                       public groupby_scan_aggregation,
-                                       public scan_aggregation {
+class ansi_sql_percent_rank_aggregation final : public rolling_aggregation,


Why is this a distinct aggregation type instead of an argument to the normal rank_aggregation?

Refer #9569 (comment)

Keep PERCENT_RANK as a separate aggregation. (Explained below.)

#3 above warrants justification, and discussion. On first glance, it would appear that ANSI SQL's PERCENT_RANK semantics could be addressed similarly to Pandas's Dataframe.rank(method='MIN', pct='True'). I suspect it can't, because:

pandas_min_percent_rank == (row_rank / num_rows_in_group); sql_percent_rank == ((row_rank - 1) / (num_rows_in_group-1));

Also, there is a 1-1 correspondence between cudf::rank_method and Pandas's Dataframe.rank(method). It would be erroneous/nonsensical to attempt PERCENT_RANK aggregations with pct=True. My vote is currently to leave PERCENT_RANK separate. There might be value in renaming PERCENT_RANK to ANSI_SQL_PERCENT_RANK, to clarify that this is ANSI SQL compliant.

@karthikeyann @mythrocks that doesn't sound like sufficient justification for a distinct aggregation when the only difference is a slight change in how the "percentage" is calculated.

For example, look at the standard deviation aggregation and how it exposes the ddof parameter to control a constant offset to the population size:

cudf/cpp/include/cudf/aggregation.hpp

Lines 252 to 261 in 565f474

/**

* @brief Factory to create a STD aggregation

*

* @param ddof Delta degrees of freedom. The divisor used in calculation of

* `std` is `N - ddof`, where `N` is the population size.

*

* @throw cudf::logic_error if input type is chrono or compound types.

*/

template <typename Base = aggregation>

std::unique_ptr<Base> make_std_aggregation(size_type ddof = 1);

Instead of making the percentage argument to make_rank_aggregation be a bool, make it something where you can expose control over how the percentage is calculated.

Do not call it sql and pandas methods. It should be a red flag if you have to refer to a specific framework for naming something. It should just be described in terms of the math.

To put in plain words about the difference between sql and pandas percentage,
sql percentage rank ranges from 0% to 100% (only applies for MIN method, all other methods are not applicable).
pandas percentage rank ranges from >0% to 100%. (all methods are applicable)

To describe in terms of math, how about adding new method named 0-indexed-min-rank and pass percentage true ? (but percentage = false will not be supported.)

Note that We are dividing by maximum of group rank value (which is group_size-1 ), not the group size.

@jrhemstad replaced ANSI_SQL_PERCENT_RANK aggregation with MIN_0_INDEXED rank_method
commit c988d8f

@karthikeyann we had a longer discussion of this issue at the libcudf team meeting today to get aligned on what we want this API to look like. The current rank_methods aside from the new MIN_0_INDEXED are all about how ties are broken, which is very different from whether we compute a percentage or not. Our proposal is to instead use a separate enum like

enum class rank_percentage : int32_t { NONE, ///< rank ZERO_NORMALIZED, ///< rank / group_size ONE_NORMALIZED ///< (rank - 1) / (group_size - 1) };

so that control over the percentage or not part of the calculation is independent of the tiebreaking. @mythrocks was tentatively supportive of this, but he mentioned that there was potentially a technical blocker here because getting the group_size inside the aggregation was a problem. Could you could identify and expand upon that problem? Figuring out where we have a problem there would help us move towards a better consensus solution.

This suggestion tracks closely with what @karthikeyann had suggested last week. Good idea.

I can't for the life of me remember what my reservation was regarding. If it turns out to be legitimate, it should turn up during implementation. If it doesn't, c'est la vie.

Count me in as a 👍.

updated code with the discussion solution. Added rank_percentage.

@jrhemstad I think this resolves your question now.

…fea-groupby_rank_full

karthikeyann · 2022-04-18T14:11:44Z

I'm still a little concerned about the sorting discussion and would like to make sure that we address that in our documentation, but as long as that discussion is resolved everything else here looks great!

added documentation in make_rank_aggregation Notes.

…fea-groupby_rank_full

karthikeyann · 2022-04-27T09:44:04Z

rerun tests

vyasr · 2022-04-27T13:02:44Z

@karthikeyann the changes seem to have broken the Java tests, any idea why? Once that's resolved I think this PR is good to merge.

karthikeyann · 2022-04-27T16:30:32Z

@karthikeyann the changes seem to have broken the Java tests, any idea why? Once that's resolved I think this PR is good to merge.

Fixed the bug.

karthikeyann · 2022-04-28T16:22:32Z

Thank you @mythrocks @vyasr @jrhemstad for all the inputs and the reviews!
Learned a lot from every suggestion made. Thank you.

karthikeyann · 2022-04-28T16:22:42Z

@gpucibot merge

move RANK, DENSE_RANK into single RANK aggregation

7085ad1

karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Oct 29, 2021

github-actions bot added the Python Affects Python cuDF API. label Oct 29, 2021

karthikeyann added 2 commits November 8, 2021 18:21

Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …

24a11c9

…fea-groupby_rank_full

Merge branch 'branch-21.12' of https://github.com/rapidsai/cudf into …

007bafb

…fea-groupby_rank_full

karthikeyann changed the base branch from branch-21.12 to branch-22.02 November 15, 2021 03:07

karthikeyann added 4 commits November 15, 2021 09:11

style fix clang-format

dca492a

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

d970133

…fea-groupby_rank_full

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

406faed

…fea-groupby_rank_full

fix factories usage

e5e97fd

github-actions bot added the Java Affects Java cuDF API. label Dec 7, 2021

karthikeyann added 2 commits December 7, 2021 19:43

fix throw message

fa1cd22

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

0e67a17

…fea-groupby_rank_full

github-actions bot added the inactive-30d label Jan 6, 2022

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

b79fe0c

…fea-groupby_rank_full

karthikeyann changed the base branch from branch-22.02 to branch-22.04 February 16, 2022 13:23

karthikeyann mentioned this pull request Feb 16, 2022

Support percent_rank() aggregation #10227

Merged

github-actions bot removed the inactive-30d label Feb 24, 2022

karthikeyann requested a review from vyasr April 14, 2022 18:50

vyasr approved these changes Apr 14, 2022

View reviewed changes

mythrocks approved these changes Apr 15, 2022

View reviewed changes

jrhemstad reviewed Apr 15, 2022

View reviewed changes

cpp/benchmarks/groupby/group_rank_benchmark.cu Outdated Show resolved Hide resolved

jrhemstad requested changes Apr 15, 2022

View reviewed changes

karthikeyann added 4 commits April 18, 2022 16:06

Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …

b810a84

…fea-groupby_rank_full

fix merge issues

fbdb80e

move groupby rank gbench to nvbench

fef487c

documentation about orderby sorting requirement in rank aggregations

51a4fe6

karthikeyann requested a review from jrhemstad April 18, 2022 14:14

karthikeyann added 5 commits April 21, 2022 18:29

Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …

ee88304

…fea-groupby_rank_full

rename ANSI_SQL_PERCENT_RANK to MIN_0_INDEXED rank_method

c988d8f

Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …

d269d2d

…fea-groupby_rank_full

add rank_percentage, replace MIN_0_INDEXED

fc8b625

update percentage enum in java, python usages

fb8c096

fix group_size==1 as zero rank percentage

36c7a53

jrhemstad approved these changes Apr 27, 2022

View reviewed changes

rapids-bot bot merged commit 9ac2477 into rapidsai:branch-22.06 Apr 28, 2022

karthikeyann mentioned this pull request May 2, 2022

[FEA] Add support for "rank" to cudf.groupby #3591

Closed

vyasr removed the 4 - Needs cuDF (Python) Reviewer label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement all methods of groupby rank aggregation in libcudf, python #9569

Implement all methods of groupby rank aggregation in libcudf, python #9569

karthikeyann commented Oct 29, 2021 •

edited

Loading

codecov bot commented Oct 29, 2021 •

edited

Loading

github-actions bot commented Jan 6, 2022

mythrocks commented Feb 24, 2022 •

edited

Loading

mythrocks commented Feb 25, 2022

vyasr left a comment

mythrocks left a comment

jrhemstad Apr 15, 2022

karthikeyann Apr 18, 2022 •

edited

Loading

jrhemstad Apr 19, 2022 •

edited

Loading

karthikeyann Apr 19, 2022

karthikeyann Apr 21, 2022 •

edited

Loading

karthikeyann Apr 25, 2022

vyasr Apr 25, 2022 •

edited by bdice

Loading

mythrocks Apr 26, 2022

karthikeyann Apr 26, 2022

vyasr Apr 27, 2022

karthikeyann commented Apr 18, 2022

karthikeyann commented Apr 27, 2022

vyasr commented Apr 27, 2022

karthikeyann commented Apr 27, 2022

karthikeyann commented Apr 28, 2022

karthikeyann commented Apr 28, 2022

	/**
	* @brief Factory to create a STD aggregation
	*
	* @param ddof Delta degrees of freedom. The divisor used in calculation of
	* `std` is `N - ddof`, where `N` is the population size.
	*
	* @throw cudf::logic_error if input type is chrono or compound types.
	*/
	template <typename Base = aggregation>
	std::unique_ptr<Base> make_std_aggregation(size_type ddof = 1);

Implement all methods of groupby rank aggregation in libcudf, python #9569

Implement all methods of groupby rank aggregation in libcudf, python #9569

Conversation

karthikeyann commented Oct 29, 2021 • edited Loading

codecov bot commented Oct 29, 2021 • edited Loading

Codecov Report

github-actions bot commented Jan 6, 2022

mythrocks commented Feb 24, 2022 • edited Loading

mythrocks commented Feb 25, 2022

vyasr left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

jrhemstad Apr 15, 2022

Choose a reason for hiding this comment

karthikeyann Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

jrhemstad Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

karthikeyann Apr 19, 2022

Choose a reason for hiding this comment

karthikeyann Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

karthikeyann Apr 25, 2022

Choose a reason for hiding this comment

vyasr Apr 25, 2022 • edited by bdice Loading

Choose a reason for hiding this comment

mythrocks Apr 26, 2022

Choose a reason for hiding this comment

karthikeyann Apr 26, 2022

Choose a reason for hiding this comment

vyasr Apr 27, 2022

Choose a reason for hiding this comment

karthikeyann commented Apr 18, 2022

karthikeyann commented Apr 27, 2022

vyasr commented Apr 27, 2022

karthikeyann commented Apr 27, 2022

karthikeyann commented Apr 28, 2022

karthikeyann commented Apr 28, 2022

karthikeyann commented Oct 29, 2021 •

edited

Loading

codecov bot commented Oct 29, 2021 •

edited

Loading

mythrocks commented Feb 24, 2022 •

edited

Loading

karthikeyann Apr 18, 2022 •

edited

Loading

jrhemstad Apr 19, 2022 •

edited

Loading

karthikeyann Apr 21, 2022 •

edited

Loading

vyasr Apr 25, 2022 •

edited by bdice

Loading