-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR/BUG: DataFrame.stack including all null rows when stacking multiple levels #53515
Comments
For the deprecation process here, I don't see any way other than introduce a new argument that defaults to being backward compatible, deprecating the default, and then removing the argument after deprecation during 3.x. |
I was initially thinking we could keep an (opt-in) option for the current behaviour with I didn't look in detail at the implementations, but just a note that if you try
(the calculation of
+1 Long term, if we don't have this behaviour of introducing NAs from the stacking, I would also say that there is no need at all for a |
Indeed it was, and I've been trying to test it more thoroughly but issues with the current implementation of stack is getting in the way. E.g. with your example
In any case, the following runs now without raising:
The implementation for |
Yes - I think we can deprecate the keyword. We could use it for this deprecation by introducing |
Yes, that FWIW, the dataframe that I used above (with MI columns using from_product) is an example where
From a quick profile, it seems there is a lot of overhead in the |
I also opened a separate issue based on a quick exploration into why this step is so surprisingly slow: #53574 |
Thanks @jorisvandenbossche - I plan to look into your suggestion of enhancing perf here. But given the bug fixes and overall better behavior, not only with dtypes but also not creating unnecessary nulls, would you agree that the perf hit as-is is worth it? |
@jorisvandenbossche - thanks for the suggestion! I've updated the implementation and timings inplace. For your example in #53515 (comment), I now get:
Edit: It also seems to be working on inputs with a MultiIndex, though this needs a lot more testing |
There are two notable behaviors differences. I think both of these are okay.
|
Agreed that both behaviour changes look OK
Yes, that indeed sounds fine
And the user can always to a final |
This PR rewrites `DataFrame.stack()`. Adding support to stacking multiple levels in the dataframe. User can now specify one or more levels from the column names to stack. Example: ```python >>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'), ... ('weight', 'pounds')]) >>> df_multi_level_cols1 = cudf.DataFrame([[1, 2], [2, 4]], ... index=['cat', 'dog'], ... columns=multicol1) >>> df_multi_level_cols1.stack(0) kg pounds cat weight 1 2 dog weight 2 4 >>> df_multi_level_cols1.stack([0, 1]) cat weight kg 1 pounds 2 dog weight kg 2 pounds 4 dtype: int64 ``` The implementation heavily uses pandas index methods on the column axis. This assumes that the width of the cudf column is limited. The combination of `len(level) > 1 and dropna=False` is currently unsupported. The corresponding behavior in pandas is due to be deprecated in 3.0. See pandas-dev/pandas#53515. closes #13739 Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Matthew Roeschke (https://github.com/mroeschke) URL: #13927
@rhshadrach im way behind on this discussion. was this not handled by future_stack? |
@jbrockmendel - the option has been added, but we still need to deprecation the old version. I plan to do this as part of 2.2 - that will close this issue. |
Ref: #53094 (comment)
From the link above,
DataFrame.stack
withdropna=False
will include column combinations that did not exist in the original DataFrame; the values are necessarily all null:I do not think it would be expected for this operation to essentially invent combinations of the stacked columns that did not exist in the original DataFrame. I propose we deprecate this behavior, and only level combinations that occur in the original DataFrame should be included in the result.
The current implementation of DataFrame.stack does not appear easily modified to enforce this deprecation. The following function (which is a bit ugly and could use quite some refinement) implements stack that would be enforced upon deprecation. The implementation is only for the case of multiple levels (so the input must have a MultiIndex).
Implementation of new_stack
It is
not as quitemore performant on two/three levels with NumPy dtypesButAnd is more performant on small data (for what it's worth); this issize=1000
with the same code above:and it more efficient when working with nullable arrays; this is with
size=100000
anddtype="Int64"
:In addition, it does not coerce dtypes unnecessarily as the current implementation does (ref: #51059, #17886). This implementation would close those two issues. E.g. this is from the top example (with
dropna=True
) that results in float64 today:In particular, when operating on Int64 dtypes, the current implementation results in object dtype whereas this implementation results in Int64.
cc @jorisvandenbossche @jbrockmendel @mroeschke
The text was updated successfully, but these errors were encountered: