-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add support for multi-column quantiles of DataFrame #43881
Comments
Thanks for the request @charlesbluca! At first glance, this API seems somewhat confusing because it sounds like a combination of 2 distinct ops -> computing the quantile on the first arg of the list, then an indexing operation with other included columns. For the average user, an alternative API might be one which allows returning the indices of the requested quantiles (analogous to the relationship between One other question here would be how to handle cases where quantiles don't evenly line up with values (so there is no corresponding index). Might have to restrict |
Another consideration would be how much faster a specific implementation compared to the alternative of just finding the indices with an equality check. |
Sorry for taking so long to chime in here @charlesbluca. Thanks for raising this!
I may be misunderstanding, but I am fairly certain that this is not what [EDIT] Ah - I guess this indexing trick sort-of works if you peform a lexicographical sort by all columns first? |
Ah yes @rjzamora raises a good point - my original example doesn't highlight this, but cuDF's multi-column In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})
In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]:
a b
0.0 1 1
0.5 1 3
1.0 1 5 Apologies for the mislead @mzeitlin11 - it looks like in this case, we would need more than just the indices of a single column
This is the exact restriction placed on cuDF's |
Is your feature request related to a problem?
For dataframes, Pandas currently only supports per-column quantiles; that is, given
df[['c', 'a']].quantile(...)
, Pandas will compute the individual quantiles for columnsc
anda
:It would be nice if Pandas also supported multi-column quantiles; that is, given
df[['c', 'a']].quantiles(...)
, Pandas would compute the quantiles for the dataframe sorted by all columns. This is currently implemented by cuDF's dataframe:Describe the solution you'd like
I imagine the addition of multi-column quantiles support could happen in two ways:
DataFrame.quantile
to specify whether or not we want multi-column quantilesquantile
In either case, my preference here would be to have this functionality accessible via
DataFrame.quantiles
, to maintain consistency with cuDF.API breaking implications
I can't think of any breakages this would cause, as long as any direct changes to
quantile
ensure that the original behavior is maintained by default.Describe alternatives you've considered
This could be accomplished by sorting the dataframe by all columns and then indexing based on manually computed quantiles, but I imagine there's a more performant way to do this.
Additional context
If this functionality were added, along with a multi-columnar
searchsorted
, it would enable Dask dataframes to computesort_values
with multiple sort-by columns, using an algorithm roughly similar to that of dask-cudf.The text was updated successfully, but these errors were encountered: