ENH: Add support for multi-column quantiles of DataFrame #43881

charlesbluca · 2021-10-04T17:34:47Z

Is your feature request related to a problem?

For dataframes, Pandas currently only supports per-column quantiles; that is, given df[['c', 'a']].quantile(...), Pandas will compute the individual quantiles for columns c and a:

>>> df = pd.DataFrame({'a': [1, 0, 11, 12, 2], 'b': [1, 2, 3, 4, 5], 'c': [0, 1, 5, 2, 3]})
>>> df[['c', 'a']].quantile([0, 0.5, 1])
       c     a
0.0  0.0   0.0
0.5  2.0   2.0
1.0  5.0  12.0

It would be nice if Pandas also supported multi-column quantiles; that is, given df[['c', 'a']].quantiles(...), Pandas would compute the quantiles for the dataframe sorted by all columns. This is currently implemented by cuDF's dataframe:

In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})
In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]: 
     a  b
0.0  1  1
0.5  1  3
1.0  1  5

Describe the solution you'd like

I imagine the addition of multi-column quantiles support could happen in two ways:

the addition of a default kwarg to DataFrame.quantile to specify whether or not we want multi-column quantiles
the addition of a new method to compute multi-column quantiles independent from the logic of quantile

In either case, my preference here would be to have this functionality accessible via DataFrame.quantiles, to maintain consistency with cuDF.

API breaking implications

I can't think of any breakages this would cause, as long as any direct changes to quantile ensure that the original behavior is maintained by default.

Describe alternatives you've considered

This could be accomplished by sorting the dataframe by all columns and then indexing based on manually computed quantiles, but I imagine there's a more performant way to do this.

Additional context

If this functionality were added, along with a multi-columnar searchsorted, it would enable Dask dataframes to compute sort_values with multiple sort-by columns, using an algorithm roughly similar to that of dask-cudf.

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2021-10-04T18:51:30Z

Thanks for the request @charlesbluca! At first glance, this API seems somewhat confusing because it sounds like a combination of 2 distinct ops -> computing the quantile on the first arg of the list, then an indexing operation with other included columns. For the average user, an alternative API might be one which allows returning the indices of the requested quantiles (analogous to the relationship between max and idxmax). This might also be implemented with something like a return_indices argument?

One other question here would be how to handle cases where quantiles don't evenly line up with values (so there is no corresponding index). Might have to restrict interpolation argument to lower, higher, nearest.

mzeitlin11 · 2021-10-04T18:58:13Z

Another consideration would be how much faster a specific implementation compared to the alternative of just finding the indices with an equality check.

rjzamora · 2021-10-06T18:22:18Z

Sorry for taking so long to chime in here @charlesbluca. Thanks for raising this!

It would be nice if Pandas also supported "multi-column" quantiles; that is, given df[['c', 'a']].quantiles(...), Pandas would:

compute the quantiles for column c
using c's quantiles as an index, select the corresponding rows of column a

I may be misunderstanding, but I am fairly certain that this is not what cudf.DataFrame.quatiles does. Rather, it computes to coupled quantiles of all columns in the DataFrame (not the quantiles of the first columns). There is no "indexing" workaround in pandas. The only workaround is to convert all columns to a single Series of tuples, which is very slow.

[EDIT] Ah - I guess this indexing trick sort-of works if you peform a lexicographical sort by all columns first?

charlesbluca · 2021-10-06T20:32:29Z

Ah yes @rjzamora raises a good point - my original example doesn't highlight this, but cuDF's multi-column quantiles does compute the quantiles for the dataframe after it is lexicographically sorted by all columns; this example makes that more obvious:

In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})

In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]: 
     a  b
0.0  1  1
0.5  1  3
1.0  1  5

Apologies for the mislead @mzeitlin11 - it looks like in this case, we would need more than just the indices of a single column quantile operation to compute this.

Might have to restrict interpolation argument to lower, higher, nearest.

This is the exact restriction placed on cuDF's quantiles:

https://github.com/rapidsai/cudf/blob/68c56b7013e0a4e9cf4b420a11e476112a6655c0/python/cudf/cudf/core/dataframe.py#L5993-L6005

charlesbluca added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2021

mzeitlin11 added Compat pandas objects compatability with Numpy or Python functions Numeric Operations Arithmetic, Comparison, and Logical operations and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2021

quasiben mentioned this issue Oct 5, 2021

Add fast path for multi-column sorting rapidsai/dask-sql#5

Closed

This was referenced Oct 7, 2021

ENH: Implement searchsorted for DataFrames #43907

Closed

ENH: Support DataFrame.searchsorted #42872

Open

charlesbluca mentioned this issue Nov 3, 2021

ENH: Implement multi-column DataFrame.quantiles #44301

Merged

4 tasks

jbrockmendel added quantile quantile method and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 16, 2021

jreback added this to the 1.5 milestone Aug 12, 2022

jreback closed this as completed in #44301 Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add support for multi-column quantiles of DataFrame #43881

ENH: Add support for multi-column quantiles of DataFrame #43881

charlesbluca commented Oct 4, 2021 •

edited

Loading

mzeitlin11 commented Oct 4, 2021

mzeitlin11 commented Oct 4, 2021

rjzamora commented Oct 6, 2021 •

edited

Loading

charlesbluca commented Oct 6, 2021

ENH: Add support for multi-column quantiles of DataFrame #43881

ENH: Add support for multi-column quantiles of DataFrame #43881

Comments

charlesbluca commented Oct 4, 2021 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

mzeitlin11 commented Oct 4, 2021

mzeitlin11 commented Oct 4, 2021

rjzamora commented Oct 6, 2021 • edited Loading

charlesbluca commented Oct 6, 2021

charlesbluca commented Oct 4, 2021 •

edited

Loading

rjzamora commented Oct 6, 2021 •

edited

Loading