BUG: crash in df.groupby.rolling.mean with forward window indexer #43267

mIgLLL · 2021-08-28T02:45:58Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here

Here is a replicate code:
df = pd.DataFrame({'B': [np.nan, 1, 2, np.nan, 4,5,6,7,8,9,10,11,12,13,14,15]})
df['A']=1
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=12) # forward 12 period
# First time is OK
df['mean']=df.groupby('A')['B'].rolling(window=indexer, min_periods=1).mean().reset_index().loc[:,'B']
# Second time would crash!
df['mean']=df.groupby('A')['B'].rolling(window=indexer, min_periods=1).mean().reset_index().loc[:,'B']

If don't have groupby, it won't crash.
If don't forward window indexer, it won't crash.
If in the first time, it won't crash.

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

When you use df.groupby.rolling.mean with a forward-looking period, it would crash if you run the df.groupby.rolling.mean again.

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : c7f7443 python : 3.8.11.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : zh_CN LOCALE : Chinese (Simplified)_China.936

pandas : 1.3.1
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.2
setuptools : 52.0.0.post20210125
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.26.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
[paste the output of pd.show_versions() here leaving a blank line after the details tag]

The text was updated successfully, but these errors were encountered:

dsm054 · 2021-08-28T04:19:10Z

Yep, I see this too. Even the simpler

df = pd.DataFrame({"A": [1, 1], "B": [0, 1]})
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=1)
df.groupby('A')['B'].rolling(window=indexer).mean()
df.groupby('A')['B'].rolling(window=indexer).mean()

segfaults on #2.

If I'm understanding the problem correctly, the indexer is being mutated. Before the first groupby, indexer.window_size is 1; after the first groupby, it's not there at all.

After that, since FixedForwardWindowIndexer is a subclass of BaseIndexer, the first branch in RollingGroupby._get_window_indexer is taken, and we build a GroupbyIndexer which has window=0 and no window_size in indexer_kwargs any more, because it's been popped, and that leads to bad things.

roll_mean isn't happy not getting an end which matches start.

mzeitlin11 · 2021-08-28T17:39:10Z

Thanks for reporting this @mIgLLL! Your investigation makes sense @dsm054, would you be interested in making a pr to fix this?

mIgLLL · 2021-08-29T01:29:01Z

BTW, I find that the indexer also have another bug.

df = pd.DataFrame({'B': [np.nan, 1, 2, np.nan, 4,5,6,7,8,9,10,11,12,13,14,15,16,17]})
df.loc[:7,'A']=1
df.loc[7:,'A']=2
df['C']=range(18)

indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=12) # forward 12 period

df['mean']=df.groupby('A')['B'].rolling(window=indexer, min_periods=1).mean().reset_index().loc[:,'B']

df=df.set_index('C')

reverse_mean=df[::-1].groupby('A')['B'].rolling(window=12, min_periods=1).mean()[::-1].reset_index().rename(columns={'B':'mean2'})

df=dd.merge(df,reverse_mean,on=['A','C'],how='left')

If there is a df.groupby, something went wrong! there is unnecessary missing values. And I find the number of missing value in "mean" column is related to the group diviation (df.loc[:7,'A']=1), If you change the size of group 1, the number of missing value also change! But it is not related to the size of group 2.

dsm054 · 2021-08-29T05:01:39Z

@mIgLLL: yeah. It looks to me like the end value (the right side of the bounds it uses for each window) is wrong-- it winds up being too long, which means that the offset is wrong after recombination. We wind up asking roll_mean to act on the values between positions

start: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17] 
end:   [ 7  7  7  7  7  7  7  7  7  7  7  7 18 18 18 18 18 18 18 18 18 18 18 18]

and you can see why you get the five NaN values in the middle. The two arrays aren't even the same length. The logic in FixedForwardWindowIndexer.get_window_bounds is b0rked, and I'm pretty sure I know where and how to fix it. It'll work by accident in isolation, but not when combined with anything else.

@mzeitlin11: sure, I can take a run at it. It's been a while, but I think I still remember how to play. :)

(To be clear, there are two separate bugs here, although both ultimately result in incorrect end arrays. I'm proposing to address both of them.)

dsm054 · 2021-08-29T15:39:52Z

Okay, I think I have a viable fix for both, and new tests have been added to verify the behaviour is sane in this and other cases.

@mzeitlin11: One thing I noticed in passing is that indices is misspelled as "indicies" in a bunch of places, unfortunately including one user-visible location, as an argument to GroupbyIndexer. Does this mean there'd need to be a deprecation period if we change it there, or is GroupbyIndexer sufficiently internal that it can simply be fixed?

mzeitlin11 · 2021-08-29T16:40:43Z

@mzeitlin11: One thing I noticed in passing is that indices is misspelled as "indicies" in a bunch of places, unfortunately including one user-visible location, as an argument to GroupbyIndexer. Does this mean there'd need to be a deprecation period if we change it there, or is GroupbyIndexer sufficiently internal that it can simply be fixed?

Not sure, cc @mroeschke if any thoughts here

mroeschke · 2021-08-29T18:33:48Z

Only the indexers here are public, https://pandas.pydata.org/pandas-docs/stable/reference/window.html#window-indexer, so other ones, including GroupbyIndexer, can have the spelling mistakes fixed.

Thanks for looking into a fix.

…v#43267)

simonjayhawkins · 2021-08-30T10:57:37Z

Okay, I think I have a viable fix for both, and new tests have been added to verify the behaviour is sane in this and other cases.

The code sample did not segfault in pandas 1.1.5

first bad commit: [a8e2f92] REF: Simplifying creation of window indexers internally (#37177)

since not all users update on every pandas release, we could perhaps consider backporting the fix for the segfault issue if the issues could be kept independent.

…v#43267)

…43291) * BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (#43267) * Undo whitespace changes * Fix type annotation of GroupbyIndexer * Ignore typing check for window_size since BaseIndexer is public * Update objects.py * Fix typing * Update per comments Co-authored-by: Jeff Reback <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>

… and GroupbyIndexer (pandas-dev#43267)

…byIndexer (#43267) (#44061) Co-authored-by: DSM <[email protected]>

mIgLLL added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 28, 2021

mzeitlin11 added Groupby Segfault Non-Recoverable Error Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 28, 2021

mzeitlin11 added this to the Contributions Welcome milestone Aug 28, 2021

dsm054 added a commit to dsm054/pandas that referenced this issue Aug 30, 2021

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (pandas-de…

11bf344

…v#43267)

dsm054 added a commit to dsm054/pandas that referenced this issue Aug 30, 2021

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (pandas-de…

5c5a1f2

…v#43267)

dsm054 added a commit to dsm054/pandas that referenced this issue Aug 30, 2021

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (pandas-de…

2755500

…v#43267)

dsm054 added a commit to dsm054/pandas that referenced this issue Aug 30, 2021

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (pandas-de…

5900e00

…v#43267)

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 30, 2021

code sample for pandas-dev#43267

1e4e380

simonjayhawkins linked a pull request Aug 30, 2021 that will close this issue

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (#43267) #43291

Merged

1 task

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 30, 2021

dsm054 added a commit to dsm054/pandas that referenced this issue Aug 30, 2021

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (pandas-de…

9fc7317

…v#43267)

simonjayhawkins mentioned this issue Aug 30, 2021

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (#43267) #43291

Merged

1 task

simonjayhawkins modified the milestones: Contributions Welcome, 1.3.3 Aug 30, 2021

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

dsm054 added a commit to dsm054/pandas that referenced this issue Sep 13, 2021

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (pandas-de…

4b9a480

…v#43267)

dsm054 added a commit to dsm054/pandas that referenced this issue Sep 13, 2021

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (pandas-de…

928ba00

…v#43267)

simonjayhawkins closed this as completed in #43291 Oct 16, 2021

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this issue Oct 16, 2021

Backport PR pandas-dev#43291: BUG: Fixes to FixedForwardWindowIndexer…

11bc77d

… and GroupbyIndexer (pandas-dev#43267)

simonjayhawkins added a commit that referenced this issue Oct 16, 2021

Backport PR #43291: BUG: Fixes to FixedForwardWindowIndexer and Group…

a326408

…byIndexer (#43267) (#44061) Co-authored-by: DSM <[email protected]>

erezinman mentioned this issue Nov 15, 2021

BUG: Segmentation-fault during a "custom" rolling operation #44470

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: crash in df.groupby.rolling.mean with forward window indexer #43267

BUG: crash in df.groupby.rolling.mean with forward window indexer #43267

mIgLLL commented Aug 28, 2021

dsm054 commented Aug 28, 2021 •

edited

Loading

mzeitlin11 commented Aug 28, 2021

mIgLLL commented Aug 29, 2021

dsm054 commented Aug 29, 2021 •

edited

Loading

dsm054 commented Aug 29, 2021

mzeitlin11 commented Aug 29, 2021

mroeschke commented Aug 29, 2021

simonjayhawkins commented Aug 30, 2021

BUG: crash in df.groupby.rolling.mean with forward window indexer #43267

BUG: crash in df.groupby.rolling.mean with forward window indexer #43267

Comments

mIgLLL commented Aug 28, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

dsm054 commented Aug 28, 2021 • edited Loading

mzeitlin11 commented Aug 28, 2021

mIgLLL commented Aug 29, 2021

dsm054 commented Aug 29, 2021 • edited Loading

dsm054 commented Aug 29, 2021

mzeitlin11 commented Aug 29, 2021

mroeschke commented Aug 29, 2021

simonjayhawkins commented Aug 30, 2021

Output of `pd.show_versions()`

dsm054 commented Aug 28, 2021 •

edited

Loading

dsm054 commented Aug 29, 2021 •

edited

Loading