Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add support for multi-column quantiles of DataFrame #43881

Closed
charlesbluca opened this issue Oct 4, 2021 · 4 comments · Fixed by #44301
Closed

ENH: Add support for multi-column quantiles of DataFrame #43881

charlesbluca opened this issue Oct 4, 2021 · 4 comments · Fixed by #44301
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement quantile quantile method
Milestone

Comments

@charlesbluca
Copy link
Contributor

charlesbluca commented Oct 4, 2021

Is your feature request related to a problem?

For dataframes, Pandas currently only supports per-column quantiles; that is, given df[['c', 'a']].quantile(...), Pandas will compute the individual quantiles for columns c and a:

>>> df = pd.DataFrame({'a': [1, 0, 11, 12, 2], 'b': [1, 2, 3, 4, 5], 'c': [0, 1, 5, 2, 3]})
>>> df[['c', 'a']].quantile([0, 0.5, 1])
       c     a
0.0  0.0   0.0
0.5  2.0   2.0
1.0  5.0  12.0

It would be nice if Pandas also supported multi-column quantiles; that is, given df[['c', 'a']].quantiles(...), Pandas would compute the quantiles for the dataframe sorted by all columns. This is currently implemented by cuDF's dataframe:

In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})
In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]: 
     a  b
0.0  1  1
0.5  1  3
1.0  1  5

Describe the solution you'd like

I imagine the addition of multi-column quantiles support could happen in two ways:

  1. the addition of a default kwarg to DataFrame.quantile to specify whether or not we want multi-column quantiles
  2. the addition of a new method to compute multi-column quantiles independent from the logic of quantile

In either case, my preference here would be to have this functionality accessible via DataFrame.quantiles, to maintain consistency with cuDF.

API breaking implications

I can't think of any breakages this would cause, as long as any direct changes to quantile ensure that the original behavior is maintained by default.

Describe alternatives you've considered

This could be accomplished by sorting the dataframe by all columns and then indexing based on manually computed quantiles, but I imagine there's a more performant way to do this.

Additional context

If this functionality were added, along with a multi-columnar searchsorted, it would enable Dask dataframes to compute sort_values with multiple sort-by columns, using an algorithm roughly similar to that of dask-cudf.

@charlesbluca charlesbluca added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2021
@mzeitlin11
Copy link
Member

Thanks for the request @charlesbluca! At first glance, this API seems somewhat confusing because it sounds like a combination of 2 distinct ops -> computing the quantile on the first arg of the list, then an indexing operation with other included columns. For the average user, an alternative API might be one which allows returning the indices of the requested quantiles (analogous to the relationship between max and idxmax). This might also be implemented with something like a return_indices argument?

One other question here would be how to handle cases where quantiles don't evenly line up with values (so there is no corresponding index). Might have to restrict interpolation argument to lower, higher, nearest.

@mzeitlin11 mzeitlin11 added Compat pandas objects compatability with Numpy or Python functions Numeric Operations Arithmetic, Comparison, and Logical operations and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2021
@mzeitlin11
Copy link
Member

Another consideration would be how much faster a specific implementation compared to the alternative of just finding the indices with an equality check.

@rjzamora
Copy link

rjzamora commented Oct 6, 2021

Sorry for taking so long to chime in here @charlesbluca. Thanks for raising this!

It would be nice if Pandas also supported "multi-column" quantiles; that is, given df[['c', 'a']].quantiles(...), Pandas would:

compute the quantiles for column c
using c's quantiles as an index, select the corresponding rows of column a

I may be misunderstanding, but I am fairly certain that this is not what cudf.DataFrame.quatiles does. Rather, it computes to coupled quantiles of all columns in the DataFrame (not the quantiles of the first columns). There is no "indexing" workaround in pandas. The only workaround is to convert all columns to a single Series of tuples, which is very slow.

[EDIT] Ah - I guess this indexing trick sort-of works if you peform a lexicographical sort by all columns first?

@charlesbluca
Copy link
Contributor Author

Ah yes @rjzamora raises a good point - my original example doesn't highlight this, but cuDF's multi-column quantiles does compute the quantiles for the dataframe after it is lexicographically sorted by all columns; this example makes that more obvious:

In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})

In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]: 
     a  b
0.0  1  1
0.5  1  3
1.0  1  5

Apologies for the mislead @mzeitlin11 - it looks like in this case, we would need more than just the indices of a single column quantile operation to compute this.

Might have to restrict interpolation argument to lower, higher, nearest.

This is the exact restriction placed on cuDF's quantiles:

https://github.com/rapidsai/cudf/blob/68c56b7013e0a4e9cf4b420a11e476112a6655c0/python/cudf/cudf/core/dataframe.py#L5993-L6005

@jbrockmendel jbrockmendel added quantile quantile method and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 16, 2021
@jreback jreback added this to the 1.5 milestone Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement quantile quantile method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants