ENH: Implement multi-column `DataFrame.quantiles` #44301

charlesbluca · 2021-11-03T15:08:09Z

Rough attempt at implementing cuDF's DataFrame.quantiles; shares a lot of common logic with sort_values, as the indexer that sorts the dataframe by all columns is ultimately what is used to grab the desired quantiles.

cc @quasiben @rjzamora

closes ENH: Add support for multi-column quantiles of DataFrame #43881
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

quasiben · 2021-11-03T15:24:47Z

@TomAugspurger this is an attempt at trying to get multi-column sorting working in Dask which requires a multi-column quantile

rjzamora · 2021-11-03T22:01:05Z

pandas/core/frame.py

+        self,
+        q=0.5,
+        axis: Axis = 0,
+        numeric_only: bool = True,


cudf's DataFrame.quantiles doesn't support a numeric_only argument, so the effective default is numeric_only=False. Any chance we could modify the default here? Is this meant to align with quantile arguments?

Note that I can understand the argument for numeric_only=True, but it may add a bit of extra pain in Dask :)

Yeah, I did this to align with the default for quantile - happy to change to False if it make sense to the devs

TomAugspurger · 2021-11-04T14:11:59Z

I'm not sure about the name... I worry about having both a .quantile and quantiles, and I don't think that quantiles really conveys what's different about this method. To me the plural version sounds like you're able to compute multiple quantiles at once, which is already supported by passing a list to quantile.

So if we do this, I'd suggest a name like multi_column_quantile or multi_quantile or quantile_table (the last one mirrors a proposed API for groupby.agg on a table). Or perhaps we make this a keyword for quantile, like table=True or columnar=False? Because IIUC the output type (Series or Frame) and labels will be identical .quantile, the only difference is the values?

quasiben · 2021-11-04T14:43:05Z

@shwina, do you think we could alias quantiles in cuDF if we do end up changing the name here ?

shwina · 2021-11-04T16:10:26Z

Yes -- that should be OK for cuDF. I also like multi_quantile or quantile_table over quantiles

charlesbluca · 2021-11-04T19:49:44Z

Cool! In that case I'm going to rename this method multi_quantile, and we can follow up with a rename / aliasing on cuDF?

jreback · 2021-11-04T20:05:17Z

@charlesbluca this hasn't received any scruity yet. -1 on adding methods directly like this.

jreback

tests are the first thing that is needed

rjzamora · 2021-11-04T20:13:58Z

-1 on adding methods directly like this.

@jreback - Is your preference to add a new option to the existing quantile method. Or are you saying that down-stream libraries should be implementing logic like this themselves?

jreback · 2021-11-06T23:47:44Z

i would add the argument method='single|table' which is what we do for .rolling and friends

jreback · 2021-11-06T23:48:33Z

pandas/core/frame.py

+        interpolation: str = "nearest",
+    ):
+        """
+        Return values at the given quantile over requested axis for all columns.


this should all be in the quantile method in algos not here

For clarity, do you mean that this code should be in a new quantile method in algos that handles the table case, or in BlockManager.quantile, where it looks like the internal implementation of DataFrame.quantile resides?

charlesbluca · 2021-11-24T16:28:23Z

Currently blocked on handling for sparse arrays - I am using BlockManager.take to perform the actual quantile operation, which retains sparse dtypes where BlockManager.quantile would not. This causes issues when handling non-list-like qs as an iloc across columns is required, which does not seem to be possible with sparse dtypes:

    def test_quantile_sparse(self, df, expected):
        # GH#17198
        # GH#24600
>       result = df.quantile(interpolation="nearest", method="table")

pandas/tests/frame/methods/test_quantile.py:591: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/frame.py:10475: in quantile
    return res.iloc[0]
pandas/core/indexing.py:957: in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
pandas/core/indexing.py:1509: in _getitem_axis
    return self.obj._ixs(key, axis=axis)
pandas/core/frame.py:3483: in _ixs
    new_values = self._mgr.fast_xs(i)
pandas/core/internals/managers.py:974: in fast_xs
    result = cls._empty((n,), dtype=dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'pandas.core.arrays.sparse.array.SparseArray'>, shape = (2,), dtype = Sparse[int64, 0]

    @classmethod
    def _empty(cls, shape: Shape, dtype: ExtensionDtype):
        """
        Create an ExtensionArray with the given shape and dtype.
        """
        obj = cls._from_sequence([], dtype=dtype)
    
        taker = np.broadcast_to(np.intp(-1), shape)
        result = obj.take(taker, allow_fill=True)
        if not isinstance(result, cls) or dtype != result.dtype:
>           raise NotImplementedError(
                f"Default 'empty' implementation is invalid for dtype='{dtype}'"
            )
E           NotImplementedError: Default 'empty' implementation is invalid for dtype='Sparse[int64, 0]'

pandas/core/arrays/base.py:1492: NotImplementedError

A few follow up questions here:

Do we want quantile(method="table") to not retain sparse dtypes to match quantile(method="single")?
If not, are there any blockers to ilocing across sparse columns that can be addressed to unblock this PR?

github-actions · 2021-12-25T00:03:33Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

charlesbluca · 2022-01-12T17:48:44Z

Still interested in working on this, currently blocked by the handling for sparse arrays, specifically if want to retain sparse dtypes for quantile(method="table")

charlesbluca

Thanks for pushing this along @mroeschke 😄 some small questions around the modified tests:

charlesbluca · 2022-08-10T19:46:29Z

pandas/tests/frame/methods/test_quantile.py

+        if method == "single":
+            assert q["A"] == np.percentile(df["A"], 10)


Is this test asserting that the output columns are correct when method == "table"?

The result will be Series objects, but I improved the assertions here to compare the entire Series results if the interpolation is linear (& method = single) or compare Series name + index if interpolation is nearest (& method = table)

charlesbluca · 2022-08-10T19:46:51Z

pandas/tests/frame/methods/test_quantile.py

+        if method == "single":
+            assert q["2000-01-17"] == np.percentile(df.loc["2000-01-17"], 90)


Is this test asserting that the output columns are correct when method == "table"?

The result will be Series objects, but I improved the assertions here to compare the entire Series results if the interpolation is linear (& method = single) or compare Series name + index if interpolation is nearest (& method = table)

jreback · 2022-08-12T00:48:02Z

pandas/core/frame.py

@@ -11259,7 +11286,43 @@ def quantile(
            res = self._constructor([], index=q, columns=cols, dtype=dtype)
            return res.__finalize__(self, method="quantile")

-        res = data._mgr.quantile(qs=q, axis=1, interpolation=interpolation)
+        valid_method = {"single", "table"}


maybe i am not getting something, but why isn't this just
np.asarray(res_df).ravel() and then reed to the existing quantile routine?

I think that approach would not work for DataFrames with mixed dtypes

For the limited set of interpolation methods supported (to start) in this PR, I think this approach is more performant as only quantile indices are calculated followed by a take.

jreback · 2022-08-17T02:01:59Z

thanks @charlesbluca really nice! (and @mroeschke for pushing over the line)

First pass at multi-column quantiles

385cff4

charlesbluca changed the title ~~Implement multi-column DataFrame.quantiles~~ ENH: Implement multi-column DataFrame.quantiles Nov 3, 2021

Update docstring

2c55c68

rjzamora reviewed Nov 3, 2021

View reviewed changes

jreback requested changes Nov 4, 2021

View reviewed changes

jreback added the Enhancement label Nov 6, 2021

jreback requested changes Nov 6, 2021

View reviewed changes

charlesbluca added 7 commits November 16, 2021 07:11

Add tests for table quantiles

4aae08c

Migrate quantiles code to quantile(method='table')

8cc274f

Add handling for degenerate case

ee28436

Merge remote-tracking branch 'upstream/master' into multi-col-quantiles

b45463b

Fix incorrect assertion in test_quantile_multi

259ff59

Improve non-numeric exclusion test

c074638

Resolve test_quantile_box failures

ec24040

github-actions bot added the Stale label Dec 25, 2021

charlesbluca added 4 commits January 13, 2022 10:29

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

3cc5d6d

Rename res to res_df

3b42472

Resolve sparse test failures

e6229c6

Remove try/except block to try and resolve new failures

54240eb

mroeschke added 4 commits July 29, 2022 12:01

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

44d14a7

Use pytest fixture to parameterize TestDataFrameQuantile

c9dd92f

Add tests validating arguments, remove unnecessary tolist()

6fd8d49

Add whatsnew note

4ebab82

mroeschke marked this pull request as ready for review July 29, 2022 23:25

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

79789cd

mroeschke added this to the 1.5 milestone Jul 29, 2022

mroeschke added 2 commits August 1, 2022 15:18

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

5a13fea

Add xfails for arraymanager

eae90bc

mroeschke removed the Stale label Aug 1, 2022

mroeschke added 4 commits August 8, 2022 21:24

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

b431cf0

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

58264ed

Add ignores

250222b

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

85bb06a

charlesbluca commented Aug 10, 2022

View reviewed changes

mroeschke added 3 commits August 10, 2022 14:30

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

9884d07

Improve assertin of test_quantile

a5977dc

Add xfail marker for arraymanager

90de88e

jreback requested changes Aug 12, 2022

View reviewed changes

mroeschke added 5 commits August 12, 2022 15:05

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

b9a10c8

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

9db6c26

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

0dee399

Merge remote-tracking branch 'upstream/main' into multi-col-quantiles

c46fcbc

Fix typing again

016f81b

jreback approved these changes Aug 17, 2022

View reviewed changes

jreback merged commit 3512e24 into pandas-dev:main Aug 17, 2022

rjzamora mentioned this pull request Aug 19, 2022

[FEA] Add method argument to DataFrame.quantile rapidsai/cudf#11572

Closed

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

ENH: Implement multi-column DataFrame.quantiles (pandas-dev#44301)

4d75d37

charlesbluca mentioned this pull request Jan 18, 2023

add support for .sort_values dask/dask#958

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement multi-column `DataFrame.quantiles` #44301

ENH: Implement multi-column `DataFrame.quantiles` #44301

charlesbluca commented Nov 3, 2021

quasiben commented Nov 3, 2021

rjzamora Nov 3, 2021

charlesbluca Nov 3, 2021

TomAugspurger commented Nov 4, 2021

quasiben commented Nov 4, 2021

shwina commented Nov 4, 2021 •

edited

Loading

charlesbluca commented Nov 4, 2021

jreback commented Nov 4, 2021

jreback left a comment

rjzamora commented Nov 4, 2021

jreback commented Nov 6, 2021

jreback Nov 6, 2021

charlesbluca Nov 8, 2021

charlesbluca commented Nov 24, 2021

github-actions bot commented Dec 25, 2021

charlesbluca commented Jan 12, 2022

charlesbluca left a comment

charlesbluca Aug 10, 2022

mroeschke Aug 10, 2022

charlesbluca Aug 10, 2022

mroeschke Aug 10, 2022

jreback Aug 12, 2022

mroeschke Aug 12, 2022

jreback commented Aug 17, 2022

		if method == "single":
		assert q["A"] == np.percentile(df["A"], 10)

		if method == "single":
		assert q["2000-01-17"] == np.percentile(df.loc["2000-01-17"], 90)

ENH: Implement multi-column DataFrame.quantiles #44301

ENH: Implement multi-column DataFrame.quantiles #44301

Conversation

charlesbluca commented Nov 3, 2021

quasiben commented Nov 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 4, 2021

quasiben commented Nov 4, 2021

shwina commented Nov 4, 2021 • edited Loading

charlesbluca commented Nov 4, 2021

jreback commented Nov 4, 2021

jreback left a comment

Choose a reason for hiding this comment

rjzamora commented Nov 4, 2021

jreback commented Nov 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlesbluca commented Nov 24, 2021

github-actions bot commented Dec 25, 2021

charlesbluca commented Jan 12, 2022

charlesbluca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 17, 2022

ENH: Implement multi-column `DataFrame.quantiles` #44301

ENH: Implement multi-column `DataFrame.quantiles` #44301

shwina commented Nov 4, 2021 •

edited

Loading