Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement multi-column DataFrame.quantiles #44301

Merged
merged 45 commits into from
Aug 17, 2022
Merged
Changes from 2 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
385cff4
First pass at multi-column quantiles
charlesbluca Nov 3, 2021
2c55c68
Update docstring
charlesbluca Nov 3, 2021
4aae08c
Add tests for table quantiles
charlesbluca Nov 16, 2021
8cc274f
Migrate quantiles code to quantile(method='table')
charlesbluca Nov 16, 2021
ee28436
Add handling for degenerate case
charlesbluca Nov 17, 2021
b45463b
Merge remote-tracking branch 'upstream/master' into multi-col-quantiles
charlesbluca Nov 17, 2021
259ff59
Fix incorrect assertion in test_quantile_multi
charlesbluca Nov 18, 2021
c074638
Improve non-numeric exclusion test
charlesbluca Nov 18, 2021
ec24040
Resolve test_quantile_box failures
charlesbluca Nov 18, 2021
3cc5d6d
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
charlesbluca Jan 13, 2022
3b42472
Rename res to res_df
charlesbluca Jan 13, 2022
e6229c6
Resolve sparse test failures
charlesbluca Jan 13, 2022
54240eb
Remove try/except block to try and resolve new failures
charlesbluca Jan 13, 2022
cec798f
Check if tests resolve when we only use transpose to unwrap
charlesbluca Jan 14, 2022
04dbdfd
Add back in try / except block
charlesbluca Jan 14, 2022
7bf7d18
Use if / else instead of try / except
charlesbluca Jan 14, 2022
34f5c68
Merge branch 'main' into multi-col-quantiles
charlesbluca Feb 22, 2022
aded5dd
Merge branch 'main' into multi-col-quantiles
charlesbluca Mar 8, 2022
9e3c300
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
charlesbluca Apr 6, 2022
939c735
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
charlesbluca Apr 11, 2022
ae16a24
Merge branch 'main' into multi-col-quantiles
charlesbluca Apr 12, 2022
d601d4e
Merge branch 'main' into multi-col-quantiles
charlesbluca Apr 13, 2022
d058765
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
charlesbluca Apr 15, 2022
c495fea
Merge branch 'main' into multi-col-quantiles
charlesbluca Apr 18, 2022
1c411fa
Merge branch 'main' into multi-col-quantiles
charlesbluca Apr 18, 2022
a019f15
Merge branch 'main' into multi-col-quantiles
charlesbluca Jul 28, 2022
44d14a7
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Jul 29, 2022
c9dd92f
Use pytest fixture to parameterize TestDataFrameQuantile
mroeschke Jul 29, 2022
6fd8d49
Add tests validating arguments, remove unnecessary tolist()
mroeschke Jul 29, 2022
4ebab82
Add whatsnew note
mroeschke Jul 29, 2022
79789cd
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Jul 29, 2022
5a13fea
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 1, 2022
eae90bc
Add xfails for arraymanager
mroeschke Aug 1, 2022
b431cf0
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 9, 2022
58264ed
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 9, 2022
250222b
Add ignores
mroeschke Aug 9, 2022
85bb06a
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 10, 2022
9884d07
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 10, 2022
a5977dc
Improve assertin of test_quantile
mroeschke Aug 10, 2022
90de88e
Add xfail marker for arraymanager
mroeschke Aug 11, 2022
b9a10c8
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 12, 2022
9db6c26
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 15, 2022
0dee399
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 15, 2022
c46fcbc
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 16, 2022
016f81b
Fix typing again
mroeschke Aug 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 113 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -10368,7 +10368,7 @@ def quantile(
interpolation: str = "linear",
):
"""
Return values at the given quantile over requested axis.
Return values at the given quantile over requested axis, per-column.

Parameters
----------
Expand Down Expand Up @@ -10460,6 +10460,118 @@ def quantile(
result = self._constructor(res)
return result

def quantiles(
self,
q=0.5,
axis: Axis = 0,
numeric_only: bool = True,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudf's DataFrame.quantiles doesn't support a numeric_only argument, so the effective default is numeric_only=False. Any chance we could modify the default here? Is this meant to align with quantile arguments?

Note that I can understand the argument for numeric_only=True, but it may add a bit of extra pain in Dask :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did this to align with the default for quantile - happy to change to False if it make sense to the devs

interpolation: str = "nearest",
):
"""
Return values at the given quantile over requested axis for all columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should all be in the quantile method in algos not here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, do you mean that this code should be in a new quantile method in algos that handles the table case, or in BlockManager.quantile, where it looks like the internal implementation of DataFrame.quantile resides?


Parameters
----------
q : float or array-like, default 0.5 (50% quantile)
Value between 0 <= q <= 1, the quantile(s) to compute.
axis : {0, 1, 'index', 'columns'}, default 0
Equals 0 or 'index' for row-wise, 1 or 'columns' for column-wise.
numeric_only : bool, default True
If False, datetime and timedelta data will be included in the
quantile computation.
interpolation : {'lower', 'higher', 'nearest'}, default 'nearest'
This optional parameter specifies the interpolation method to use,
when the desired quantile lies between two data points `i` and `j`:

* lower: `i`.
* higher: `j`.
* nearest: `i` or `j` whichever is nearest.

Returns
-------
Series or DataFrame

If ``q`` is an array, a DataFrame will be returned where the
index is ``q``, the columns are the columns of self, and the
values are the quantiles.
If ``q`` is a float, a Series will be returned where the
index is the columns of self and the values are the quantiles.

See Also
--------
core.window.Rolling.quantile: Rolling quantile.
numpy.percentile: Numpy function to compute the percentile.

Examples
--------
>>> df = pd.DataFrame(np.array([[1, 10], [1, 2], [2, 100], [2, 50]]),
... columns=['a', 'b'])
>>> df.quantiles(.1)
a 1
b 2
Name: 0.1, dtype: int64
>>> df.quantiles([.1, .5])
a b
0.1 1 2
0.5 2 50

Specifying `numeric_only=False` will also compute the quantile of
datetime and timedelta data.

>>> df = pd.DataFrame({'A': [1, 2],
... 'B': [pd.Timestamp('2010'),
... pd.Timestamp('2011')],
... 'C': [pd.Timedelta('1 days'),
... pd.Timedelta('2 days')]})
>>> df.quantiles(0.5, numeric_only=False)
A 1
B 2010-01-01 00:00:00
C 1 days 00:00:00
Name: 0.5, dtype: object
"""
validate_percentile(q)

return_series = False
if not is_list_like(q):
return_series = True
q = [q]

q = Index(q, dtype=np.float64)
data = self._get_numeric_data() if numeric_only else self
axis = self._get_axis_number(axis)

if axis == 1:
data = data.T

if len(data.columns) == 0:
# GH#23925 _get_numeric_data may have dropped all columns
cols = Index([], name=self.columns.name)
if is_list_like(q):
return self._constructor([], index=q, columns=cols)
return self._constructor_sliced([], index=cols, name=q, dtype=np.float64)

q_idx = np.quantile(np.arange(len(data)), q, interpolation=interpolation)

by = data.columns.tolist()
if len(by) > 1:
keys = [data._get_label_or_level_values(x) for x in by]
indexer = lexsort_indexer(keys)
else:
by = by[0]
k = data._get_label_or_level_values(by)
indexer = nargsort(k)

res = data._mgr.take(indexer[q_idx], verify=False)

result = self._constructor(res)
if return_series:
result = result.T.iloc[:, 0]
result.name = q[0]
else:
result.index = q

return result

@doc(NDFrame.asfreq, **_shared_doc_kwargs)
def asfreq(
self,
Expand Down