-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: crash in df.groupby.rolling.mean with forward window indexer #43267
Comments
Yep, I see this too. Even the simpler
segfaults on #2. If I'm understanding the problem correctly, the indexer is being mutated. Before the first groupby, indexer.window_size is 1; after the first groupby, it's not there at all. After that, since FixedForwardWindowIndexer is a subclass of BaseIndexer, the first branch in RollingGroupby._get_window_indexer is taken, and we build a GroupbyIndexer which has window=0 and no window_size in indexer_kwargs any more, because it's been popped, and that leads to bad things. roll_mean isn't happy not getting an |
BTW, I find that the indexer also have another bug.
If there is a df.groupby, something went wrong! there is unnecessary missing values. And I find the number of missing value in "mean" column is related to the group diviation (df.loc[:7,'A']=1), If you change the size of group 1, the number of missing value also change! But it is not related to the size of group 2. |
@mIgLLL: yeah. It looks to me like the
and you can see why you get the five NaN values in the middle. The two arrays aren't even the same length. The logic in FixedForwardWindowIndexer.get_window_bounds is b0rked, and I'm pretty sure I know where and how to fix it. It'll work by accident in isolation, but not when combined with anything else. @mzeitlin11: sure, I can take a run at it. It's been a while, but I think I still remember how to play. :) (To be clear, there are two separate bugs here, although both ultimately result in incorrect |
Okay, I think I have a viable fix for both, and new tests have been added to verify the behaviour is sane in this and other cases. @mzeitlin11: One thing I noticed in passing is that indices is misspelled as "indicies" in a bunch of places, unfortunately including one user-visible location, as an argument to GroupbyIndexer. Does this mean there'd need to be a deprecation period if we change it there, or is GroupbyIndexer sufficiently internal that it can simply be fixed? |
Not sure, cc @mroeschke if any thoughts here |
Only the indexers here are public, https://pandas.pydata.org/pandas-docs/stable/reference/window.html#window-indexer, so other ones, including GroupbyIndexer, can have the spelling mistakes fixed. Thanks for looking into a fix. |
The code sample did not segfault in pandas 1.1.5 first bad commit: [a8e2f92] REF: Simplifying creation of window indexers internally (#37177) since not all users update on every pandas release, we could perhaps consider backporting the fix for the segfault issue if the issues could be kept independent. |
…43291) * BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (#43267) * Undo whitespace changes * Fix type annotation of GroupbyIndexer * Ignore typing check for window_size since BaseIndexer is public * Update objects.py * Fix typing * Update per comments Co-authored-by: Jeff Reback <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>
… and GroupbyIndexer (pandas-dev#43267)
…byIndexer (#43267) (#44061) Co-authored-by: DSM <[email protected]>
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
[this should explain why the current behaviour is a problem and why the expected output is a better solution]
When you use df.groupby.rolling.mean with a forward-looking period, it would crash if you run the df.groupby.rolling.mean again.
Expected Output
Output of
pd.show_versions()
pandas : 1.3.1
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.2
setuptools : 52.0.0.post20210125
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.26.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
[paste the output of
pd.show_versions()
here leaving a blank line after the details tag]The text was updated successfully, but these errors were encountered: