-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement multi-column DataFrame.quantiles
#44301
Conversation
DataFrame.quantiles
DataFrame.quantiles
@TomAugspurger this is an attempt at trying to get multi-column sorting working in Dask which requires a multi-column quantile |
pandas/core/frame.py
Outdated
self, | ||
q=0.5, | ||
axis: Axis = 0, | ||
numeric_only: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudf's DataFrame.quantiles
doesn't support a numeric_only
argument, so the effective default is numeric_only=False
. Any chance we could modify the default here? Is this meant to align with quantile
arguments?
Note that I can understand the argument for numeric_only=True
, but it may add a bit of extra pain in Dask :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did this to align with the default for quantile
- happy to change to False
if it make sense to the devs
I'm not sure about the name... I worry about having both a So if we do this, I'd suggest a name like |
@shwina, do you think we could alias |
Yes -- that should be OK for cuDF. I also like |
Cool! In that case I'm going to rename this method |
@charlesbluca this hasn't received any scruity yet. -1 on adding methods directly like this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests are the first thing that is needed
@jreback - Is your preference to add a new option to the existing |
i would add the argument |
pandas/core/frame.py
Outdated
interpolation: str = "nearest", | ||
): | ||
""" | ||
Return values at the given quantile over requested axis for all columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should all be in the quantile method in algos not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, do you mean that this code should be in a new quantile
method in algos that handles the table case, or in BlockManager.quantile
, where it looks like the internal implementation of DataFrame.quantile
resides?
Currently blocked on handling for sparse arrays - I am using def test_quantile_sparse(self, df, expected):
# GH#17198
# GH#24600
> result = df.quantile(interpolation="nearest", method="table")
pandas/tests/frame/methods/test_quantile.py:591:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/frame.py:10475: in quantile
return res.iloc[0]
pandas/core/indexing.py:957: in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
pandas/core/indexing.py:1509: in _getitem_axis
return self.obj._ixs(key, axis=axis)
pandas/core/frame.py:3483: in _ixs
new_values = self._mgr.fast_xs(i)
pandas/core/internals/managers.py:974: in fast_xs
result = cls._empty((n,), dtype=dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = <class 'pandas.core.arrays.sparse.array.SparseArray'>, shape = (2,), dtype = Sparse[int64, 0]
@classmethod
def _empty(cls, shape: Shape, dtype: ExtensionDtype):
"""
Create an ExtensionArray with the given shape and dtype.
"""
obj = cls._from_sequence([], dtype=dtype)
taker = np.broadcast_to(np.intp(-1), shape)
result = obj.take(taker, allow_fill=True)
if not isinstance(result, cls) or dtype != result.dtype:
> raise NotImplementedError(
f"Default 'empty' implementation is invalid for dtype='{dtype}'"
)
E NotImplementedError: Default 'empty' implementation is invalid for dtype='Sparse[int64, 0]'
pandas/core/arrays/base.py:1492: NotImplementedError A few follow up questions here:
|
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
Still interested in working on this, currently blocked by the handling for sparse arrays, specifically if want to retain sparse dtypes for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this along @mroeschke 😄 some small questions around the modified tests:
if method == "single": | ||
assert q["A"] == np.percentile(df["A"], 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test asserting that the output columns are correct when method == "table"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result will be Series objects, but I improved the assertions here to compare the entire Series results if the interpolation is linear (& method = single) or compare Series name + index if interpolation is nearest (& method = table)
if method == "single": | ||
assert q["2000-01-17"] == np.percentile(df.loc["2000-01-17"], 90) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test asserting that the output columns are correct when method == "table"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result will be Series objects, but I improved the assertions here to compare the entire Series results if the interpolation is linear (& method = single) or compare Series name + index if interpolation is nearest (& method = table)
@@ -11259,7 +11286,43 @@ def quantile( | |||
res = self._constructor([], index=q, columns=cols, dtype=dtype) | |||
return res.__finalize__(self, method="quantile") | |||
|
|||
res = data._mgr.quantile(qs=q, axis=1, interpolation=interpolation) | |||
valid_method = {"single", "table"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i am not getting something, but why isn't this just
np.asarray(res_df).ravel()
and then reed to the existing quantile routine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think that approach would not work for DataFrames with mixed dtypes
- For the limited set of interpolation methods supported (to start) in this PR, I think this approach is more performant as only quantile indices are calculated followed by a
take
.
thanks @charlesbluca really nice! (and @mroeschke for pushing over the line) |
Rough attempt at implementing cuDF's
DataFrame.quantiles
; shares a lot of common logic withsort_values
, as the indexer that sorts the dataframe by all columns is ultimately what is used to grab the desired quantiles.cc @quasiben @rjzamora