Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: crash in df.groupby.rolling.mean with forward window indexer #43267

Closed
3 tasks
mIgLLL opened this issue Aug 28, 2021 · 8 comments · Fixed by #43291
Closed
3 tasks

BUG: crash in df.groupby.rolling.mean with forward window indexer #43267

mIgLLL opened this issue Aug 28, 2021 · 8 comments · Fixed by #43291
Labels
Bug Groupby Regression Functionality that used to work in a prior pandas version Segfault Non-Recoverable Error Window rolling, ewma, expanding
Milestone

Comments

@mIgLLL
Copy link

mIgLLL commented Aug 28, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here

Here is a replicate code:
df = pd.DataFrame({'B': [np.nan, 1, 2, np.nan, 4,5,6,7,8,9,10,11,12,13,14,15]})
df['A']=1
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=12) # forward 12 period
# First time is OK
df['mean']=df.groupby('A')['B'].rolling(window=indexer, min_periods=1).mean().reset_index().loc[:,'B']
# Second time would crash!
df['mean']=df.groupby('A')['B'].rolling(window=indexer, min_periods=1).mean().reset_index().loc[:,'B']

If don't have groupby, it won't crash.
If don't forward window indexer, it won't crash.
If in the first time, it won't crash.

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

When you use df.groupby.rolling.mean with a forward-looking period, it would crash if you run the df.groupby.rolling.mean again.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : c7f7443 python : 3.8.11.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : zh_CN LOCALE : Chinese (Simplified)_China.936

pandas : 1.3.1
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.2
setuptools : 52.0.0.post20210125
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.26.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
[paste the output of pd.show_versions() here leaving a blank line after the details tag]

image

@mIgLLL mIgLLL added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 28, 2021
@dsm054
Copy link
Contributor

dsm054 commented Aug 28, 2021

Yep, I see this too. Even the simpler

df = pd.DataFrame({"A": [1, 1], "B": [0, 1]})
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=1)
df.groupby('A')['B'].rolling(window=indexer).mean()
df.groupby('A')['B'].rolling(window=indexer).mean()

segfaults on #2.

If I'm understanding the problem correctly, the indexer is being mutated. Before the first groupby, indexer.window_size is 1; after the first groupby, it's not there at all.

After that, since FixedForwardWindowIndexer is a subclass of BaseIndexer, the first branch in RollingGroupby._get_window_indexer is taken, and we build a GroupbyIndexer which has window=0 and no window_size in indexer_kwargs any more, because it's been popped, and that leads to bad things.

roll_mean isn't happy not getting an end which matches start.

@mzeitlin11
Copy link
Member

Thanks for reporting this @mIgLLL! Your investigation makes sense @dsm054, would you be interested in making a pr to fix this?

@mzeitlin11 mzeitlin11 added Groupby Segfault Non-Recoverable Error Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 28, 2021
@mzeitlin11 mzeitlin11 added this to the Contributions Welcome milestone Aug 28, 2021
@mIgLLL
Copy link
Author

mIgLLL commented Aug 29, 2021

BTW, I find that the indexer also have another bug.

df = pd.DataFrame({'B': [np.nan, 1, 2, np.nan, 4,5,6,7,8,9,10,11,12,13,14,15,16,17]})
df.loc[:7,'A']=1
df.loc[7:,'A']=2
df['C']=range(18)

indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=12) # forward 12 period

df['mean']=df.groupby('A')['B'].rolling(window=indexer, min_periods=1).mean().reset_index().loc[:,'B']

df=df.set_index('C')

reverse_mean=df[::-1].groupby('A')['B'].rolling(window=12, min_periods=1).mean()[::-1].reset_index().rename(columns={'B':'mean2'})

df=dd.merge(df,reverse_mean,on=['A','C'],how='left')

If there is a df.groupby, something went wrong! there is unnecessary missing values. And I find the number of missing value in "mean" column is related to the group diviation (df.loc[:7,'A']=1), If you change the size of group 1, the number of missing value also change! But it is not related to the size of group 2.

@dsm054
Copy link
Contributor

dsm054 commented Aug 29, 2021

@mIgLLL: yeah. It looks to me like the end value (the right side of the bounds it uses for each window) is wrong-- it winds up being too long, which means that the offset is wrong after recombination. We wind up asking roll_mean to act on the values between positions

start: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17] 
end:   [ 7  7  7  7  7  7  7  7  7  7  7  7 18 18 18 18 18 18 18 18 18 18 18 18]

and you can see why you get the five NaN values in the middle. The two arrays aren't even the same length. The logic in FixedForwardWindowIndexer.get_window_bounds is b0rked, and I'm pretty sure I know where and how to fix it. It'll work by accident in isolation, but not when combined with anything else.

@mzeitlin11: sure, I can take a run at it. It's been a while, but I think I still remember how to play. :)

(To be clear, there are two separate bugs here, although both ultimately result in incorrect end arrays. I'm proposing to address both of them.)

@dsm054
Copy link
Contributor

dsm054 commented Aug 29, 2021

Okay, I think I have a viable fix for both, and new tests have been added to verify the behaviour is sane in this and other cases.

@mzeitlin11: One thing I noticed in passing is that indices is misspelled as "indicies" in a bunch of places, unfortunately including one user-visible location, as an argument to GroupbyIndexer. Does this mean there'd need to be a deprecation period if we change it there, or is GroupbyIndexer sufficiently internal that it can simply be fixed?

@mzeitlin11
Copy link
Member

@mzeitlin11: One thing I noticed in passing is that indices is misspelled as "indicies" in a bunch of places, unfortunately including one user-visible location, as an argument to GroupbyIndexer. Does this mean there'd need to be a deprecation period if we change it there, or is GroupbyIndexer sufficiently internal that it can simply be fixed?

Not sure, cc @mroeschke if any thoughts here

@mroeschke
Copy link
Member

Only the indexers here are public, https://pandas.pydata.org/pandas-docs/stable/reference/window.html#window-indexer, so other ones, including GroupbyIndexer, can have the spelling mistakes fixed.

Thanks for looking into a fix.

@simonjayhawkins
Copy link
Member

Okay, I think I have a viable fix for both, and new tests have been added to verify the behaviour is sane in this and other cases.

The code sample did not segfault in pandas 1.1.5

first bad commit: [a8e2f92] REF: Simplifying creation of window indexers internally (#37177)

since not all users update on every pandas release, we could perhaps consider backporting the fix for the segfault issue if the issues could be kept independent.

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 30, 2021
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.3.3 Aug 30, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
simonjayhawkins pushed a commit that referenced this issue Oct 16, 2021
…43291)

* BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (#43267)

* Undo whitespace changes

* Fix type annotation of GroupbyIndexer

* Ignore typing check for window_size since BaseIndexer is public

* Update objects.py

* Fix typing

* Update per comments

Co-authored-by: Jeff Reback <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this issue Oct 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Regression Functionality that used to work in a prior pandas version Segfault Non-Recoverable Error Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants