Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement multi-column DataFrame.quantiles #44301

Merged
merged 45 commits into from
Aug 17, 2022

Conversation

charlesbluca
Copy link
Contributor

Rough attempt at implementing cuDF's DataFrame.quantiles; shares a lot of common logic with sort_values, as the indexer that sorts the dataframe by all columns is ultimately what is used to grab the desired quantiles.

cc @quasiben @rjzamora

@charlesbluca charlesbluca changed the title Implement multi-column DataFrame.quantiles ENH: Implement multi-column DataFrame.quantiles Nov 3, 2021
@quasiben
Copy link

quasiben commented Nov 3, 2021

@TomAugspurger this is an attempt at trying to get multi-column sorting working in Dask which requires a multi-column quantile

self,
q=0.5,
axis: Axis = 0,
numeric_only: bool = True,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudf's DataFrame.quantiles doesn't support a numeric_only argument, so the effective default is numeric_only=False. Any chance we could modify the default here? Is this meant to align with quantile arguments?

Note that I can understand the argument for numeric_only=True, but it may add a bit of extra pain in Dask :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did this to align with the default for quantile - happy to change to False if it make sense to the devs

@TomAugspurger
Copy link
Contributor

I'm not sure about the name... I worry about having both a .quantile and quantiles, and I don't think that quantiles really conveys what's different about this method. To me the plural version sounds like you're able to compute multiple quantiles at once, which is already supported by passing a list to quantile.

So if we do this, I'd suggest a name like multi_column_quantile or multi_quantile or quantile_table (the last one mirrors a proposed API for groupby.agg on a table). Or perhaps we make this a keyword for quantile, like table=True or columnar=False? Because IIUC the output type (Series or Frame) and labels will be identical .quantile, the only difference is the values?

@quasiben
Copy link

quasiben commented Nov 4, 2021

@shwina, do you think we could alias quantiles in cuDF if we do end up changing the name here ?

@shwina
Copy link
Contributor

shwina commented Nov 4, 2021

Yes -- that should be OK for cuDF. I also like multi_quantile or quantile_table over quantiles

@charlesbluca
Copy link
Contributor Author

Cool! In that case I'm going to rename this method multi_quantile, and we can follow up with a rename / aliasing on cuDF?

@jreback
Copy link
Contributor

jreback commented Nov 4, 2021

@charlesbluca this hasn't received any scruity yet. -1 on adding methods directly like this.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are the first thing that is needed

@rjzamora
Copy link

rjzamora commented Nov 4, 2021

-1 on adding methods directly like this.

@jreback - Is your preference to add a new option to the existing quantile method. Or are you saying that down-stream libraries should be implementing logic like this themselves?

@jreback
Copy link
Contributor

jreback commented Nov 6, 2021

i would add the argument method='single|table' which is what we do for .rolling and friends

interpolation: str = "nearest",
):
"""
Return values at the given quantile over requested axis for all columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should all be in the quantile method in algos not here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, do you mean that this code should be in a new quantile method in algos that handles the table case, or in BlockManager.quantile, where it looks like the internal implementation of DataFrame.quantile resides?

@charlesbluca
Copy link
Contributor Author

Currently blocked on handling for sparse arrays - I am using BlockManager.take to perform the actual quantile operation, which retains sparse dtypes where BlockManager.quantile would not. This causes issues when handling non-list-like qs as an iloc across columns is required, which does not seem to be possible with sparse dtypes:

    def test_quantile_sparse(self, df, expected):
        # GH#17198
        # GH#24600
>       result = df.quantile(interpolation="nearest", method="table")

pandas/tests/frame/methods/test_quantile.py:591: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/frame.py:10475: in quantile
    return res.iloc[0]
pandas/core/indexing.py:957: in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
pandas/core/indexing.py:1509: in _getitem_axis
    return self.obj._ixs(key, axis=axis)
pandas/core/frame.py:3483: in _ixs
    new_values = self._mgr.fast_xs(i)
pandas/core/internals/managers.py:974: in fast_xs
    result = cls._empty((n,), dtype=dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'pandas.core.arrays.sparse.array.SparseArray'>, shape = (2,), dtype = Sparse[int64, 0]

    @classmethod
    def _empty(cls, shape: Shape, dtype: ExtensionDtype):
        """
        Create an ExtensionArray with the given shape and dtype.
        """
        obj = cls._from_sequence([], dtype=dtype)
    
        taker = np.broadcast_to(np.intp(-1), shape)
        result = obj.take(taker, allow_fill=True)
        if not isinstance(result, cls) or dtype != result.dtype:
>           raise NotImplementedError(
                f"Default 'empty' implementation is invalid for dtype='{dtype}'"
            )
E           NotImplementedError: Default 'empty' implementation is invalid for dtype='Sparse[int64, 0]'

pandas/core/arrays/base.py:1492: NotImplementedError

A few follow up questions here:

  • Do we want quantile(method="table") to not retain sparse dtypes to match quantile(method="single")?
  • If not, are there any blockers to ilocing across sparse columns that can be addressed to unblock this PR?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 25, 2021
@charlesbluca
Copy link
Contributor Author

Still interested in working on this, currently blocked by the handling for sparse arrays, specifically if want to retain sparse dtypes for quantile(method="table")

@mroeschke mroeschke marked this pull request as ready for review July 29, 2022 23:25
@mroeschke mroeschke added this to the 1.5 milestone Jul 29, 2022
@mroeschke mroeschke removed the Stale label Aug 1, 2022
Copy link
Contributor Author

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this along @mroeschke 😄 some small questions around the modified tests:

Comment on lines 90 to 91
if method == "single":
assert q["A"] == np.percentile(df["A"], 10)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test asserting that the output columns are correct when method == "table"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result will be Series objects, but I improved the assertions here to compare the entire Series results if the interpolation is linear (& method = single) or compare Series name + index if interpolation is nearest (& method = table)

Comment on lines 97 to 98
if method == "single":
assert q["2000-01-17"] == np.percentile(df.loc["2000-01-17"], 90)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test asserting that the output columns are correct when method == "table"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result will be Series objects, but I improved the assertions here to compare the entire Series results if the interpolation is linear (& method = single) or compare Series name + index if interpolation is nearest (& method = table)

@@ -11259,7 +11286,43 @@ def quantile(
res = self._constructor([], index=q, columns=cols, dtype=dtype)
return res.__finalize__(self, method="quantile")

res = data._mgr.quantile(qs=q, axis=1, interpolation=interpolation)
valid_method = {"single", "table"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i am not getting something, but why isn't this just
np.asarray(res_df).ravel() and then reed to the existing quantile routine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think that approach would not work for DataFrames with mixed dtypes
  2. For the limited set of interpolation methods supported (to start) in this PR, I think this approach is more performant as only quantile indices are calculated followed by a take.

@jreback jreback merged commit 3512e24 into pandas-dev:main Aug 17, 2022
@jreback
Copy link
Contributor

jreback commented Aug 17, 2022

thanks @charlesbluca really nice! (and @mroeschke for pushing over the line)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add support for multi-column quantiles of DataFrame
7 participants