Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: centered moving windows mishandle series edges #2953

Closed
bmu opened this issue Mar 1, 2013 · 20 comments
Closed

BUG: centered moving windows mishandle series edges #2953

bmu opened this issue Mar 1, 2013 · 20 comments
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations Window rolling, ewma, expanding

Comments

@bmu
Copy link

bmu commented Mar 1, 2013

Seems like there is a bug in the centered mooving window functions, although the examples here don't show it.
Also np.convolve used like in this post on Stackoverflow shows different results for the first values:

In [109]: ser = pd.Series(randn(20), index=pd.date_range('1/1/2000', periods=20))

In [110]: df = pd.DataFrame(ser)

In [111]: df['rc'] = pd.rolling_count(ser, 10, center=True)

In [112]: df['rm'] = pd.rolling_mean(ser, 10, center=True)

In [113]: df['rm_min_1'] = pd.rolling_mean(ser, 10, center=True, min_periods=1)

In [114]: def movingaverage(interval, window_size):
   .....:         window = numpy.ones(int(window_size))/float(window_size)
   .....:         return numpy.convolve(interval, window, 'same')
   .....: 

In [116]: df['ma'] = movingaverage(ser, 10)

In [117]: df
Out[117]: 
                   0  rc        rm  rm_min_1        ma
2000-01-01  1.177692   5       NaN  0.048306  0.024153
2000-01-02 -0.068567   6       NaN -0.094287 -0.056572
2000-01-03  0.194942   7       NaN -0.225103 -0.157572
2000-01-04 -0.179778   8       NaN -0.160040 -0.128032
2000-01-05 -0.882760   9       NaN -0.262176 -0.235959
2000-01-06 -0.807252  10 -0.250828 -0.250828 -0.250828
2000-01-07 -1.009999  10 -0.365703 -0.365703 -0.365703
2000-01-08  0.295402  10 -0.431110 -0.431110 -0.431110
2000-01-09 -1.079267  10 -0.408968 -0.408968 -0.408968
2000-01-10 -0.148689  10 -0.456946 -0.456946 -0.456946
2000-01-11  0.028935  10 -0.290453 -0.290453 -0.290453
2000-01-12 -0.722632  10 -0.204166 -0.204166 -0.204166
2000-01-13  0.416364  10 -0.175837 -0.175837 -0.175837
2000-01-14 -0.659560  10 -0.266773 -0.266773 -0.266773
2000-01-15  0.782165  10 -0.184003 -0.184003 -0.184003
2000-01-16  0.055617  10 -0.338359 -0.338359 -0.338359
2000-01-17 -0.726706   0       NaN       NaN -0.341252
2000-01-18 -0.613954   0       NaN       NaN -0.268989
2000-01-19 -0.251574   0       NaN       NaN -0.310625
2000-01-20 -1.692243   0       NaN       NaN -0.244669
In [5]: import numpy

In [6]: In [20]: ser = pd.Series(range(10))
   ...:     ...: df = pd.DataFrame(ser)
   ...:     ...: N=3
   ...:     ...: df['rc'] = ser.rolling(N, center=True).count()
   ...:     ...: df['rm'] = ser.rolling(N, center=True).mean()
   ...:     ...: df['rm_min_1'] = ser.rolling(N, center=True, min_periods=1).mean()
   ...:     ...: def movingaverage(interval, window_size):
   ...:     ...:     window = numpy.ones(int(window_size))/float(window_size)
   ...:     ...:     return numpy.convolve(interval, window, 'same')
   ...:     ...:
   ...:     ...: df['ma'] = movingaverage(ser, N)

In [7]: df
Out[7]:
   0   rc   rm  rm_min_1        ma
0  0  2.0  NaN       0.5  0.333333
1  1  3.0  1.0       1.0  1.000000
2  2  3.0  2.0       2.0  2.000000
3  3  3.0  3.0       3.0  3.000000
4  4  3.0  4.0       4.0  4.000000
5  5  3.0  5.0       5.0  5.000000
6  6  3.0  6.0       6.0  6.000000
7  7  3.0  7.0       7.0  7.000000
8  8  3.0  8.0       8.0  8.000000
9  9  2.0  NaN       8.5  5.666667
@ghost ghost assigned changhiskhan Mar 6, 2013
@ghost
Copy link

ghost commented Mar 13, 2013

Let's try this with just a straight integer range as input data:

In [20]: ser = pd.Series(range(10))
    ...: df = pd.DataFrame(ser)
    ...: N=3
    ...: df['rc'] = pd.rolling_count(ser, N, center=True)
    ...: df['rm'] = pd.rolling_mean(ser, N, center=True)
    ...: df['rm_min_1'] = pd.rolling_mean(ser, N, center=True, min_periods=1)
    ...: def movingaverage(interval, window_size):
    ...:     window = numpy.ones(int(window_size))/float(window_size)
    ...:     return numpy.convolve(interval, window, 'same')
    ...: 
    ...: df['ma'] = movingaverage(ser, N)
    ...: df
Out[20]: 
   0  rc  rm  rm_min_1        ma
0  0   2 NaN       0.5  0.333333
1  1   3   1       1.0  1.000000
2  2   3   2       2.0  2.000000
3  3   3   3       3.0  3.000000
4  4   3   4       4.0  4.000000
5  5   3   5       5.0  5.000000
6  6   3   6       6.0  6.000000
7  7   3   7       7.0  7.000000
8  8   3   8       8.0  8.000000
9  9   0 NaN       NaN  5.666667

In [21]: 

numpy and pandas agree on points away from the boundary,
and the numpy docstring warns you about boundary effects,
behavior there being a matter of convention rather then definition.

This looks as it should be to me, can you be more specific
about the behaviour you expected to see?

@bmu
Copy link
Author

bmu commented Mar 13, 2013

I would expect that df.ix[9,'rc'] would be 2 and that df.ix[9, 'rm_min_1'] would be 8.5.

Basically my problem is, that the window size is not computed as expected at the end of the series.
As I have nan's in my data (that's the reason for the min_period in my previous example) I include an nan here:

In [55]: ser = pd.Series(range(10), dtype='float')

In [56]: ser[8] = np.nan

In [57]: pd.rolling_count(ser, 5, center=True)
Out[57]: 
0    3
1    4
2    5
3    5
4    5
5    5
6    4
7    4
8    0
9    0

If the window would be centred around the index I would expect the same results as for ser.ix[i-2:i+2] in
the case of a window size of 5:

In [47]: for i in range(10):
    print 'Count for ser[%2d:%2d]: %d' % (i - 2, i + 2, ser.ix[i-2:i+2].count())
   ....:     
Count for ser[-2: 2]: 3
Count for ser[-1: 3]: 4
Count for ser[ 0: 4]: 5
Count for ser[ 1: 5]: 5
Count for ser[ 2: 6]: 5
Count for ser[ 3: 7]: 5
Count for ser[ 4: 8]: 4
Count for ser[ 5: 9]: 4
Count for ser[ 6:10]: 3
Count for ser[ 7:11]: 2

@ghost
Copy link

ghost commented Mar 13, 2013

yep, I see it. the start and end are inconsistent in how they treat missing datums.
marking as bug.

@ghost
Copy link

ghost commented Mar 18, 2013

@jreback , take a look at 8bd09ac, I started working on this but it's tricky
and I won't have time to finish soon.

currently, the way this works is by running the function as usual, and then shifting the result
(in moment.py:_center_window), this leavs missing values at the end, and mis-normalized
values in the beginning (at least for rolling_mean it invalidates the original calc,
see 0.5 instead of 0.333 in example).

padding a numpy array is a problem because of copying, and I was having a hell
of a time generalizing the 1d case I did to nd. plus there are a bunch of broken
tests by this fix, and need to sift to see which are wrong test vectors and which real problems.

If you care, give this a shot, I can't come back to this for a while.

@ghost ghost closed this as completed Mar 18, 2013
@ghost ghost reopened this Mar 18, 2013
@jreback
Copy link
Contributor

jreback commented Mar 18, 2013

@y-p i will take a look....

@jreback
Copy link
Contributor

jreback commented Mar 18, 2013

@y-p

ok got almost everything to work, just some minor logic issues, rolling_kurt/skew needed some mods;
added support for ndim>1 was easy

everything passes (only modded 1 test, which I think is right), except rolling_count, I am a bit unclear why its not exactly working, but have to leave you something !

db78efb (my fork, cmov branch)

@ghost
Copy link

ghost commented Mar 19, 2013

Ok, looked at your fix jeff. you took it further, but still plenty of issues.

In [1]: In [20]: ser = pd.Series(range(10))
   ...:     ...: df = pd.DataFrame(ser)
   ...:     ...: N=3
   ...:     ...: df['rc'] = pd.rolling_count(ser, N, center=True)
   ...:     ...: df['rm'] = pd.rolling_mean(ser, N, center=True)
   ...:     ...: df['rm_min_1'] = pd.rolling_mean(ser, N, center=True, min_periods=1)
   ...:     ...: def movingaverage(interval, window_size):
   ...:     ...:     window = numpy.ones(int(window_size))/float(window_size)
   ...:     ...:     return numpy.convolve(interval, window, 'same')
   ...:     ...: 
   ...:     ...: df['ma'] = movingaverage(ser, N)
   ...:     ...: df
Out[1]: 
   0  rc        rm  rm_min_1        ma
0  0   0  0.333333       NaN  0.333333
1  1   0  1.000000       NaN  1.000000
2  2   1  2.000000         2  2.000000
3  3   2  3.000000         3  3.000000
4  4   3  4.000000         4  4.000000
5  5   3  5.000000         5  5.000000
6  6   3  6.000000         6  6.000000
7  7   3  7.000000         7  7.000000
8  8   3  8.000000         8  8.000000
9  9   2  5.666667       NaN  5.666667

can't explain the weird number formatting, I verified that rm_min_1 has float64 dtype like
the rest of the columns.

  • the nans should be symmetric in rm_min_1 at both ends (one missing at both ends, or 2, etc')
  • no nans should occur in rm_min_1 when min_periods=1, 1 (at each end) for (3,2), 3 for(5,4).
    and 1 for (4,3) ( ...I think. based on numpy's convention of "center" for even-length windows)
  • Fairly sure some of the passing tests validate the wrong answer.

@jreback
Copy link
Contributor

jreback commented Mar 19, 2013

cherry picked just that commit

@ghost
Copy link

ghost commented Mar 19, 2013

ok, I'll pick it up where you left off. thanks for taking care of the nd case.

@bmu
Copy link
Author

bmu commented Mar 20, 2013

at first, thanks for your work.

I just thought about how to "do it right" if we have an even window for a centered moving window: in the "engineering statisitics handbook" of the NIST a double smoothing is done (http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc422.htm).
Not sure, how other libraries (some equivalent functions like rollmean and rollapply are included in the R zoo package) handle this. Also I'm not sure, if ARMA models (statsmodels) use centered windows and how they do it.

I know that this will further complicate these functions (and make them slower, maybe cython would be a better option for computing such windows?), however even windows seems to be quite common (you would usually read "We use a 10-year moving average" instead of "We use an 11-year moving average" in publications).

I don't want to over-complicate things, what do you think about this?

@ghost
Copy link

ghost commented Mar 20, 2013

We'll definitely need to get the basics down before doing anything more fancy.
the inner loop of these functions is already written in cython, we just need to
fix the edges in a way that doesn't have a big impact. correctness first, then performance
and then enhancements.

@ghost
Copy link

ghost commented Mar 24, 2013

moving to 0.12, there are a lot of hairy details involved in the fix, it should
sit in review for a while.

@ghost
Copy link

ghost commented Mar 25, 2013

All tests pass for index nlevels =1.

It's unclear what the right thing to do is with rolling_mean at edges,
should missing data count as an existing value of 0 , or treated like other
missing values, so that mean is computed as sum/n_notna rather then sum/win

The numpy code above treats value "missing" because they are ooutside the array boundries,
and nans in the data itself differently. pandas' supports minp to control the behaviour
of nans in the data itself.

In [18]: def ma(interval, window_size):
    ...:     window = numpy.ones(int(window_size))/float(window_size)
    ...:     return numpy.convolve(interval, window, 'same')
    ...: 
    ...: ser= np.ones(9)
    ...: ser[len(ser)//2] = np.nan
    ...: print list(ma(ser,3))
    ...: print list(pd.rolling_mean(ser,3,2,center=True))
    ...: print list(pd.rolling_mean(ser,3,3,center=True))
    ...: print list(pd.rolling_mean(ser,3,1,center=True,pad_val=0))
[0.66666666666666663, 1.0, 1.0, nan, nan, nan, 1.0, 1.0, 0.66666666666666663]
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
[nan, 1.0, 1.0, nan, nan, nan, 1.0, 1.0, nan]
[0.66666666666666663, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.66666666666666663]

As you can see, it's difficult to reproduce numpy's behaviour, since minp=win
inevitably produces nans at the edges as well as in the center.
Perhaps this is actually more consistent.

pad_val is an API change, and i'm not sure whether it's a good idea. for one thing,
we'll have to backport this to uncentered roll_functions (expanding_* as well?)
and the code is already pretty tangled.

Feedback welcome.

@ghost
Copy link

ghost commented Mar 25, 2013

@jreback, I cp'd your changes to roll_skew/roll_kurt as-is, I need to know if the
guard you put in against the ZeroDivision needs to have the results validated,
or did you already do that?

@jreback
Copy link
Contributor

jreback commented Mar 25, 2013

@y-p I didn't do a test if that's what you are asking, but its already tested by comparing vs kurt/skew in any event
this only comes up in the centering becuase the nobs can change and potentially go to 0, it should have been there before in any event, just probably never got called with 0 nobs

@ghost
Copy link

ghost commented Mar 25, 2013

Thanks, I'll do a spot check, but should be fine.

@ghost
Copy link

ghost commented Mar 31, 2013

@bmu, can you try out #3161 and report any issues?

@bmu
Copy link
Author

bmu commented Apr 23, 2013

I didn't see an issues for my examples and some more tests.

Some thoughts:

  • not sure about pad_val, maybe using a fixed nan (no argument at all) would be simpler

  • I can't get how the default min_periods is calculated, maybe this could be documented

  • bottlenecks moving functions seem to be faster (but there are no centered alternatives)

    In [83]: s = pd.Series(range(1000), dtype='float')
    
    In [84]: s[8] = np.nan
    
    In [85]: import bottleneck as bn
    
    In [86]: %timeit pd.rolling_mean(s, 3)
    10000 loops, best of 3: 38 us per loop
    
    In [87]: %timeit bn.move_nanmean(s, 3)
    100000 loops, best of 3: 7.1 us per loop

    As bottleneck is a recommended dependency it could be used in the cases where
    center=False.

(However I`m happy with your solution as it is)

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 9, 2014
@jreback jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015
@arw2019
Copy link
Member

arw2019 commented Sep 20, 2020

On 1.2 master:

In [8]: import numpy as np 
   ...: import pandas as pd 
   ...:  
   ...: ser = pd.Series(range(10)) 
   ...: df = pd.DataFrame(ser) 
   ...: N=3 
   ...: df['rc'] = ser.rolling(N, center=True).count() 
   ...: df['rm'] = ser.rolling(N, center=True).mean() 
   ...: df['rm_min_1'] = ser.rolling(N, center=True, min_periods=1).mean() 
   ...: def movingaverage(interval, window_size): 
   ...:     window = np.ones(int(window_size))/float(window_size) 
   ...:     return np.convolve(interval, window, 'same') 
   ...:  
   ...: df['ma'] = movingaverage(ser, N) 
   ...: df                                                                                                                                                                                                        
Out[8]: 
   0   rc   rm  rm_min_1        ma
0  0  2.0  NaN       0.5  0.333333
1  1  3.0  1.0       1.0  1.000000
2  2  3.0  2.0       2.0  2.000000
3  3  3.0  3.0       3.0  3.000000
4  4  3.0  4.0       4.0  4.000000
5  5  3.0  5.0       5.0  5.000000
6  6  3.0  6.0       6.0  6.000000
7  7  3.0  7.0       7.0  7.000000
8  8  3.0  8.0       8.0  8.000000
Output of pd.show_versions()

INSTALLED VERSIONS

commit : a22cf43
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-47-generic
Version : #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020
machine : x86_64
processor :
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0.dev0+446.ga22cf439e
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.1.0.post20200704
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.19.0
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : 0.4.0
gcsfs : 0.6.2
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.0
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.50.1

@mroeschke
Copy link
Member

It's not entirely clear what's left to do with this issue. We can reopen if there's a focused example of the remaining buggy behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

5 participants