-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement multi-column DataFrame.quantiles
#44301
Changes from 2 commits
385cff4
2c55c68
4aae08c
8cc274f
ee28436
b45463b
259ff59
c074638
ec24040
3cc5d6d
3b42472
e6229c6
54240eb
cec798f
04dbdfd
7bf7d18
34f5c68
aded5dd
9e3c300
939c735
ae16a24
d601d4e
d058765
c495fea
1c411fa
a019f15
44d14a7
c9dd92f
6fd8d49
4ebab82
79789cd
5a13fea
eae90bc
b431cf0
58264ed
250222b
85bb06a
9884d07
a5977dc
90de88e
b9a10c8
9db6c26
0dee399
c46fcbc
016f81b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10368,7 +10368,7 @@ def quantile( | |
interpolation: str = "linear", | ||
): | ||
""" | ||
Return values at the given quantile over requested axis. | ||
Return values at the given quantile over requested axis, per-column. | ||
|
||
Parameters | ||
---------- | ||
|
@@ -10460,6 +10460,118 @@ def quantile( | |
result = self._constructor(res) | ||
return result | ||
|
||
def quantiles( | ||
self, | ||
q=0.5, | ||
axis: Axis = 0, | ||
numeric_only: bool = True, | ||
interpolation: str = "nearest", | ||
): | ||
""" | ||
Return values at the given quantile over requested axis for all columns. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should all be in the quantile method in algos not here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For clarity, do you mean that this code should be in a new |
||
|
||
Parameters | ||
---------- | ||
q : float or array-like, default 0.5 (50% quantile) | ||
Value between 0 <= q <= 1, the quantile(s) to compute. | ||
axis : {0, 1, 'index', 'columns'}, default 0 | ||
Equals 0 or 'index' for row-wise, 1 or 'columns' for column-wise. | ||
numeric_only : bool, default True | ||
If False, datetime and timedelta data will be included in the | ||
quantile computation. | ||
interpolation : {'lower', 'higher', 'nearest'}, default 'nearest' | ||
This optional parameter specifies the interpolation method to use, | ||
when the desired quantile lies between two data points `i` and `j`: | ||
|
||
* lower: `i`. | ||
* higher: `j`. | ||
* nearest: `i` or `j` whichever is nearest. | ||
|
||
Returns | ||
------- | ||
Series or DataFrame | ||
|
||
If ``q`` is an array, a DataFrame will be returned where the | ||
index is ``q``, the columns are the columns of self, and the | ||
values are the quantiles. | ||
If ``q`` is a float, a Series will be returned where the | ||
index is the columns of self and the values are the quantiles. | ||
|
||
See Also | ||
-------- | ||
core.window.Rolling.quantile: Rolling quantile. | ||
numpy.percentile: Numpy function to compute the percentile. | ||
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame(np.array([[1, 10], [1, 2], [2, 100], [2, 50]]), | ||
... columns=['a', 'b']) | ||
>>> df.quantiles(.1) | ||
a 1 | ||
b 2 | ||
Name: 0.1, dtype: int64 | ||
>>> df.quantiles([.1, .5]) | ||
a b | ||
0.1 1 2 | ||
0.5 2 50 | ||
|
||
Specifying `numeric_only=False` will also compute the quantile of | ||
datetime and timedelta data. | ||
|
||
>>> df = pd.DataFrame({'A': [1, 2], | ||
... 'B': [pd.Timestamp('2010'), | ||
... pd.Timestamp('2011')], | ||
... 'C': [pd.Timedelta('1 days'), | ||
... pd.Timedelta('2 days')]}) | ||
>>> df.quantiles(0.5, numeric_only=False) | ||
A 1 | ||
B 2010-01-01 00:00:00 | ||
C 1 days 00:00:00 | ||
Name: 0.5, dtype: object | ||
""" | ||
validate_percentile(q) | ||
|
||
return_series = False | ||
if not is_list_like(q): | ||
return_series = True | ||
q = [q] | ||
|
||
q = Index(q, dtype=np.float64) | ||
data = self._get_numeric_data() if numeric_only else self | ||
axis = self._get_axis_number(axis) | ||
|
||
if axis == 1: | ||
data = data.T | ||
|
||
if len(data.columns) == 0: | ||
# GH#23925 _get_numeric_data may have dropped all columns | ||
cols = Index([], name=self.columns.name) | ||
if is_list_like(q): | ||
return self._constructor([], index=q, columns=cols) | ||
return self._constructor_sliced([], index=cols, name=q, dtype=np.float64) | ||
|
||
q_idx = np.quantile(np.arange(len(data)), q, interpolation=interpolation) | ||
|
||
by = data.columns.tolist() | ||
if len(by) > 1: | ||
keys = [data._get_label_or_level_values(x) for x in by] | ||
indexer = lexsort_indexer(keys) | ||
else: | ||
by = by[0] | ||
k = data._get_label_or_level_values(by) | ||
indexer = nargsort(k) | ||
|
||
res = data._mgr.take(indexer[q_idx], verify=False) | ||
|
||
result = self._constructor(res) | ||
if return_series: | ||
result = result.T.iloc[:, 0] | ||
result.name = q[0] | ||
else: | ||
result.index = q | ||
|
||
return result | ||
|
||
@doc(NDFrame.asfreq, **_shared_doc_kwargs) | ||
def asfreq( | ||
self, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudf's
DataFrame.quantiles
doesn't support anumeric_only
argument, so the effective default isnumeric_only=False
. Any chance we could modify the default here? Is this meant to align withquantile
arguments?Note that I can understand the argument for
numeric_only=True
, but it may add a bit of extra pain in Dask :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did this to align with the default for
quantile
- happy to change toFalse
if it make sense to the devs