Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement multi-column DataFrame.quantiles #44301

Merged
merged 45 commits into from
Aug 17, 2022
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
385cff4
First pass at multi-column quantiles
charlesbluca Nov 3, 2021
2c55c68
Update docstring
charlesbluca Nov 3, 2021
4aae08c
Add tests for table quantiles
charlesbluca Nov 16, 2021
8cc274f
Migrate quantiles code to quantile(method='table')
charlesbluca Nov 16, 2021
ee28436
Add handling for degenerate case
charlesbluca Nov 17, 2021
b45463b
Merge remote-tracking branch 'upstream/master' into multi-col-quantiles
charlesbluca Nov 17, 2021
259ff59
Fix incorrect assertion in test_quantile_multi
charlesbluca Nov 18, 2021
c074638
Improve non-numeric exclusion test
charlesbluca Nov 18, 2021
ec24040
Resolve test_quantile_box failures
charlesbluca Nov 18, 2021
3cc5d6d
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
charlesbluca Jan 13, 2022
3b42472
Rename res to res_df
charlesbluca Jan 13, 2022
e6229c6
Resolve sparse test failures
charlesbluca Jan 13, 2022
54240eb
Remove try/except block to try and resolve new failures
charlesbluca Jan 13, 2022
cec798f
Check if tests resolve when we only use transpose to unwrap
charlesbluca Jan 14, 2022
04dbdfd
Add back in try / except block
charlesbluca Jan 14, 2022
7bf7d18
Use if / else instead of try / except
charlesbluca Jan 14, 2022
34f5c68
Merge branch 'main' into multi-col-quantiles
charlesbluca Feb 22, 2022
aded5dd
Merge branch 'main' into multi-col-quantiles
charlesbluca Mar 8, 2022
9e3c300
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
charlesbluca Apr 6, 2022
939c735
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
charlesbluca Apr 11, 2022
ae16a24
Merge branch 'main' into multi-col-quantiles
charlesbluca Apr 12, 2022
d601d4e
Merge branch 'main' into multi-col-quantiles
charlesbluca Apr 13, 2022
d058765
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
charlesbluca Apr 15, 2022
c495fea
Merge branch 'main' into multi-col-quantiles
charlesbluca Apr 18, 2022
1c411fa
Merge branch 'main' into multi-col-quantiles
charlesbluca Apr 18, 2022
a019f15
Merge branch 'main' into multi-col-quantiles
charlesbluca Jul 28, 2022
44d14a7
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Jul 29, 2022
c9dd92f
Use pytest fixture to parameterize TestDataFrameQuantile
mroeschke Jul 29, 2022
6fd8d49
Add tests validating arguments, remove unnecessary tolist()
mroeschke Jul 29, 2022
4ebab82
Add whatsnew note
mroeschke Jul 29, 2022
79789cd
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Jul 29, 2022
5a13fea
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 1, 2022
eae90bc
Add xfails for arraymanager
mroeschke Aug 1, 2022
b431cf0
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 9, 2022
58264ed
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 9, 2022
250222b
Add ignores
mroeschke Aug 9, 2022
85bb06a
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 10, 2022
9884d07
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 10, 2022
a5977dc
Improve assertin of test_quantile
mroeschke Aug 10, 2022
90de88e
Add xfail marker for arraymanager
mroeschke Aug 11, 2022
b9a10c8
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 12, 2022
9db6c26
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 15, 2022
0dee399
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 15, 2022
c46fcbc
Merge remote-tracking branch 'upstream/main' into multi-col-quantiles
mroeschke Aug 16, 2022
016f81b
Fix typing again
mroeschke Aug 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,7 @@ Other enhancements
- :class:`Series` reducers (e.g. ``min``, ``max``, ``sum``, ``mean``) will now successfully operate when the dtype is numeric and ``numeric_only=True`` is provided; previously this would raise a ``NotImplementedError`` (:issue:`47500`)
- :meth:`RangeIndex.union` now can return a :class:`RangeIndex` instead of a :class:`Int64Index` if the resulting values are equally spaced (:issue:`47557`, :issue:`43885`)
- :meth:`DataFrame.compare` now accepts an argument ``result_names`` to allow the user to specify the result's names of both left and right DataFrame which are being compared. This is by default ``'self'`` and ``'other'`` (:issue:`44354`)
- :meth:`DataFrame.quantile` gained a ``method`` argument that can accept ``table`` to evaluate multi-column quantiles (:issue:`43881`)
- :meth:`Series.add_suffix`, :meth:`DataFrame.add_suffix`, :meth:`Series.add_prefix` and :meth:`DataFrame.add_prefix` support a ``copy`` argument. If ``False``, the underlying data is not copied in the returned object (:issue:`47934`)

.. ---------------------------------------------------------------------------
Expand Down
71 changes: 67 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,10 @@
npt,
)
from pandas.compat._optional import import_optional_dependency
from pandas.compat.numpy import function as nv
from pandas.compat.numpy import (
function as nv,
np_percentile_argname,
)
from pandas.util._decorators import (
Appender,
Substitution,
Expand Down Expand Up @@ -11148,6 +11151,7 @@ def quantile(
axis: Axis = 0,
numeric_only: bool | lib.NoDefault = no_default,
interpolation: str = "linear",
method: Literal["single", "table"] = "single",
):
"""
Return values at the given quantile over requested axis.
Expand Down Expand Up @@ -11176,6 +11180,10 @@ def quantile(
* higher: `j`.
* nearest: `i` or `j` whichever is nearest.
* midpoint: (`i` + `j`) / 2.
method : {'single', 'table'}, default 'single'
Whether to compute quantiles per-column ('single') or over all columns
('table'). When 'table', the only allowed interpolation methods are
'nearest', 'lower', and 'higher'.
mroeschke marked this conversation as resolved.
Show resolved Hide resolved

Returns
-------
Expand Down Expand Up @@ -11205,6 +11213,17 @@ def quantile(
0.1 1.3 3.7
0.5 2.5 55.0

Specifying `method='table'` will compute the quantile over all columns.

>>> df.quantile(.1, method="table", interpolation="nearest")
a 1
b 1
Name: 0.1, dtype: int64
>>> df.quantile([.1, .5], method="table", interpolation="nearest")
a b
0.1 1 1
0.5 3 100

Specifying `numeric_only=False` will also compute the quantile of
datetime and timedelta data.

Expand All @@ -11229,9 +11248,17 @@ def quantile(
if not is_list_like(q):
# BlockManager.quantile expects listlike, so we wrap and unwrap here
res_df = self.quantile(
[q], axis=axis, numeric_only=numeric_only, interpolation=interpolation
[q],
axis=axis,
numeric_only=numeric_only,
interpolation=interpolation,
method=method,
)
res = res_df.iloc[0]
if method == "single":
res = res_df.iloc[0]
else:
# cannot directly iloc over sparse arrays
res = res_df.T.iloc[:, 0]
if axis == 1 and len(self) == 0:
# GH#41544 try to get an appropriate dtype
dtype = find_common_type(list(self.dtypes))
Expand Down Expand Up @@ -11259,7 +11286,43 @@ def quantile(
res = self._constructor([], index=q, columns=cols, dtype=dtype)
return res.__finalize__(self, method="quantile")

res = data._mgr.quantile(qs=q, axis=1, interpolation=interpolation)
valid_method = {"single", "table"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i am not getting something, but why isn't this just
np.asarray(res_df).ravel() and then reed to the existing quantile routine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think that approach would not work for DataFrames with mixed dtypes
  2. For the limited set of interpolation methods supported (to start) in this PR, I think this approach is more performant as only quantile indices are calculated followed by a take.

if method not in valid_method:
raise ValueError(
f"Invalid method: {method}. Method must be in {valid_method}."
)
if method == "single":
res = data._mgr.quantile(qs=q, axis=1, interpolation=interpolation)
elif method == "table":
valid_interpolation = {"nearest", "lower", "higher"}
if interpolation not in valid_interpolation:
raise ValueError(
f"Invalid interpolation: {interpolation}. "
f"Interpolation must be in {valid_interpolation}"
)
# handle degenerate case
if len(data) == 0:
if data.ndim == 2:
dtype = find_common_type(list(self.dtypes))
else:
dtype = self.dtype
return self._constructor([], index=q, columns=data.columns, dtype=dtype)

q_idx = np.quantile( # type: ignore[call-overload]
np.arange(len(data)), q, **{np_percentile_argname: interpolation}
)

by = data.columns
if len(by) > 1:
keys = [data._get_label_or_level_values(x) for x in by]
indexer = lexsort_indexer(keys)
else:
by = by[0]
k = data._get_label_or_level_values(by) # type: ignore[arg-type]
indexer = nargsort(k)

res = data._mgr.take(indexer[q_idx], verify=False)
res.axes[1] = q

result = self._constructor(res)
return result.__finalize__(self, method="quantile")
Expand Down
Loading