Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement all methods of groupby rank aggregation in libcudf, python #9569

Merged
merged 56 commits into from
Apr 28, 2022

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Oct 29, 2021

Addresses part of #3591

  • move RANK (min method), DENSE_RANK (dense method) into single RANK aggregation
  • max method
  • average method
  • first method
  • percentage
  • order, null order
    RANK, DENSE_RANK was implemented for spark requirement. Pandas groupby has 3 more methods. rank(column_view, rank_method) already has all 5 methods implemented.

Current implementation has 2 separate aggregations RANK and DENSE_RANK. This is merged to single RANK with parameters rank_aggregation(rank_method method, null_policy null_handling, bool percentage)
Groupby.rank support for 3 more methods will be added.

This PR is also pre-requisite for spearman correlation.

Additionally

  • Cython, Python plumbing
  • benchmark for groupby rank (all methods)
  • PERCENT_RANK aggregation is replaced with percentage argument ONE_NORMALIZED argument in RANK aggregation

@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Oct 29, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 29, 2021
@codecov
Copy link

codecov bot commented Oct 29, 2021

Codecov Report

Merging #9569 (d20a783) into branch-22.06 (d6e3068) will increase coverage by 0.03%.
The diff coverage is 96.96%.

❗ Current head d20a783 differs from pull request most recent head 36c7a53. Consider uploading reports for the commit 36c7a53 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06    #9569      +/-   ##
================================================
+ Coverage         86.36%   86.40%   +0.03%     
================================================
  Files               142      142              
  Lines             22302    22312      +10     
================================================
+ Hits              19261    19278      +17     
+ Misses             3041     3034       -7     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/testing/dataset_generator.py 73.25% <ø> (ø)
python/cudf/cudf/testing/testing.py 81.69% <ø> (ø)
python/dask_cudf/dask_cudf/io/orc.py 91.04% <ø> (ø)
python/cudf/cudf/api/types.py 89.36% <100.00%> (-0.44%) ⬇️
python/cudf/cudf/core/_base_index.py 85.71% <100.00%> (+0.25%) ⬆️
python/cudf/cudf/core/column_accessor.py 93.47% <100.00%> (ø)
python/cudf/cudf/core/dtypes.py 97.30% <100.00%> (ø)
python/cudf/cudf/core/frame.py 93.41% <100.00%> (ø)
python/cudf/cudf/core/groupby/groupby.py 91.79% <100.00%> (+0.34%) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae7e979...36c7a53. Read the comment docs.

@karthikeyann karthikeyann changed the base branch from branch-21.12 to branch-22.02 November 15, 2021 03:07
@github-actions github-actions bot added the Java Affects Java cuDF API. label Dec 7, 2021
@github-actions
Copy link

github-actions bot commented Jan 6, 2022

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@karthikeyann karthikeyann changed the base branch from branch-22.02 to branch-22.04 February 16, 2022 13:23
@mythrocks
Copy link
Contributor

mythrocks commented Feb 24, 2022

On the question of making percent_rank() an option in rank_method, I think that might be possible. But we might have a challenge w.r.t conventions.

I should point out that all the SQL engines that I have tested/checked (including Hive, Impala, Spark, Presto, Drill, Oracle, and MySQL) follow the ANSI SQL conventions detailed here:

percent_rank = (rank - 1) / (rows - 1)

Here's an illustration from MySQL:

mysql> select *, rank() over ( order by num_legs ) as `rank`, percent_rank() over ( order by num_legs ) as `percent_rank` from animals;
+---------+----------+------+--------------+
| animal  | num_legs | rank | percent_rank |
+---------+----------+------+--------------+
| snake   |     NULL |    1 |            0 |
| penguin |        2 |    2 |         0.25 |
| cat     |        4 |    3 |          0.5 |
| dog     |        4 |    3 |          0.5 |
| spider  |        8 |    5 |            1 |
+---------+----------+------+--------------+

My guess is that DaskSQL would prefer to follow the same convention in the future.
Note how this differs from the Pandas conventions of using rank/num_rows:

>>> df['default_rank'] = df['Number_legs'].rank()
>>> df['max_rank'] = df['Number_legs'].rank(method='max')
>>> df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
>>> df['pct_rank'] = df['Number_legs'].rank(pct=True)
>>> df
    Animal  Number_legs  default_rank  max_rank  NA_bottom  pct_rank
0      cat          4.0           2.5       3.0        2.5     0.625
1  penguin          2.0           1.0       1.0        1.0     0.250
2      dog          4.0           2.5       3.0        2.5     0.625
3   spider          8.0           4.0       4.0        4.0     1.000
4    snake          NaN           NaN       NaN        5.0       NaN

When/if percent_rank is folded into rank_method, it would be good to have a way to choose between SQL and Pandas semantics. There are bound to be takers for both.

@mythrocks
Copy link
Contributor

When/if percent_rank is folded into rank_method, it would be good to have a way to choose between SQL and Pandas semantics. There are bound to be takers for both.

Apropos, I'm leaning towards the following:

  1. For grouping support in cudf::rank(), we delegate to groupby::sort_scan() (and scan_result_functor), since they're aware of groupby semantics for ranking. These already work for Spark. Pandas semantics seem to be identical.
  2. Since Pandas allows for expressing all ranking functions as fractions (via the pct=True parameter), we add support for pct. Since Spark does not use this, the feature caters only to Pandas.
  3. Keep PERCENT_RANK as a separate aggregation. (Explained below.)

#3 above warrants justification, and discussion. On first glance, it would appear that ANSI SQL's PERCENT_RANK semantics could be addressed similarly to Pandas's Dataframe.rank(method='MIN', pct='True').
I suspect it can't, because:

pandas_min_percent_rank == (row_rank / num_rows_in_group);
sql_percent_rank == ((row_rank - 1) / (num_rows_in_group-1));

Also, there is a 1-1 correspondence between cudf::rank_method and Pandas's Dataframe.rank(method). It would be erroneous/nonsensical to attempt PERCENT_RANK aggregations with pct=True.
My vote is currently to leave PERCENT_RANK separate. There might be value in renaming PERCENT_RANK to ANSI_SQL_PERCENT_RANK, to clarify that this is ANSI SQL compliant.

@karthikeyann karthikeyann requested a review from vyasr April 14, 2022 18:50
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little concerned about the sorting discussion and would like to make sure that we address that in our documentation, but as long as that discussion is resolved everything else here looks great!

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the C++/Java bits. Also, tested this with Spark integration. Things look good.

class percent_rank_aggregation final : public rolling_aggregation,
public groupby_scan_aggregation,
public scan_aggregation {
class ansi_sql_percent_rank_aggregation final : public rolling_aggregation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a distinct aggregation type instead of an argument to the normal rank_aggregation?

Copy link
Contributor Author

@karthikeyann karthikeyann Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer #9569 (comment)

  1. Keep PERCENT_RANK as a separate aggregation. (Explained below.)

#3 above warrants justification, and discussion. On first glance, it would appear that ANSI SQL's PERCENT_RANK semantics could be addressed similarly to Pandas's Dataframe.rank(method='MIN', pct='True'). I suspect it can't, because:

pandas_min_percent_rank == (row_rank / num_rows_in_group);
sql_percent_rank == ((row_rank - 1) / (num_rows_in_group-1));

Also, there is a 1-1 correspondence between cudf::rank_method and Pandas's Dataframe.rank(method). It would be erroneous/nonsensical to attempt PERCENT_RANK aggregations with pct=True. My vote is currently to leave PERCENT_RANK separate. There might be value in renaming PERCENT_RANK to ANSI_SQL_PERCENT_RANK, to clarify that this is ANSI SQL compliant.

Copy link
Contributor

@jrhemstad jrhemstad Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikeyann @mythrocks that doesn't sound like sufficient justification for a distinct aggregation when the only difference is a slight change in how the "percentage" is calculated.

For example, look at the standard deviation aggregation and how it exposes the ddof parameter to control a constant offset to the population size:

/**
* @brief Factory to create a STD aggregation
*
* @param ddof Delta degrees of freedom. The divisor used in calculation of
* `std` is `N - ddof`, where `N` is the population size.
*
* @throw cudf::logic_error if input type is chrono or compound types.
*/
template <typename Base = aggregation>
std::unique_ptr<Base> make_std_aggregation(size_type ddof = 1);

Instead of making the percentage argument to make_rank_aggregation be a bool, make it something where you can expose control over how the percentage is calculated.

Do not call it sql and pandas methods. It should be a red flag if you have to refer to a specific framework for naming something. It should just be described in terms of the math.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put in plain words about the difference between sql and pandas percentage,
sql percentage rank ranges from 0% to 100% (only applies for MIN method, all other methods are not applicable).
pandas percentage rank ranges from >0% to 100%. (all methods are applicable)

Copy link
Contributor Author

@karthikeyann karthikeyann Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To describe in terms of math, how about adding new method named 0-indexed-min-rank and pass percentage true ? (but percentage = false will not be supported.)

Note that We are dividing by maximum of group rank value (which is group_size-1 ), not the group size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad replaced ANSI_SQL_PERCENT_RANK aggregation with MIN_0_INDEXED rank_method
commit c988d8f

Copy link
Contributor

@vyasr vyasr Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikeyann we had a longer discussion of this issue at the libcudf team meeting today to get aligned on what we want this API to look like. The current rank_methods aside from the new MIN_0_INDEXED are all about how ties are broken, which is very different from whether we compute a percentage or not. Our proposal is to instead use a separate enum like

enum class rank_percentage : int32_t {
  NONE,             ///< rank
  ZERO_NORMALIZED,  ///< rank / group_size
  ONE_NORMALIZED    ///< (rank - 1) / (group_size - 1)
};

so that control over the percentage or not part of the calculation is independent of the tiebreaking. @mythrocks was tentatively supportive of this, but he mentioned that there was potentially a technical blocker here because getting the group_size inside the aggregation was a problem. Could you could identify and expand upon that problem? Figuring out where we have a problem there would help us move towards a better consensus solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion tracks closely with what @karthikeyann had suggested last week. Good idea.

I can't for the life of me remember what my reservation was regarding. If it turns out to be legitimate, it should turn up during implementation. If it doesn't, c'est la vie.

Count me in as a 👍.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated code with the discussion solution. Added rank_percentage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad I think this resolves your question now.

@karthikeyann
Copy link
Contributor Author

I'm still a little concerned about the sorting discussion and would like to make sure that we address that in our documentation, but as long as that discussion is resolved everything else here looks great!

added documentation in make_rank_aggregation Notes.

@karthikeyann karthikeyann requested a review from jrhemstad April 18, 2022 14:14
@karthikeyann
Copy link
Contributor Author

rerun tests

@vyasr
Copy link
Contributor

vyasr commented Apr 27, 2022

@karthikeyann the changes seem to have broken the Java tests, any idea why? Once that's resolved I think this PR is good to merge.

@karthikeyann
Copy link
Contributor Author

@karthikeyann the changes seem to have broken the Java tests, any idea why? Once that's resolved I think this PR is good to merge.

Fixed the bug.

@karthikeyann
Copy link
Contributor Author

Thank you @mythrocks @vyasr @jrhemstad for all the inputs and the reviews!
Learned a lot from every suggestion made. Thank you.

@karthikeyann
Copy link
Contributor Author

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants