BUG: centered moving windows mishandle series edges #2953

bmu · 2013-03-01T14:04:06Z

Seems like there is a bug in the centered mooving window functions, although the examples here don't show it.
Also np.convolve used like in this post on Stackoverflow shows different results for the first values:

In [109]: ser = pd.Series(randn(20), index=pd.date_range('1/1/2000', periods=20))

In [110]: df = pd.DataFrame(ser)

In [111]: df['rc'] = pd.rolling_count(ser, 10, center=True)

In [112]: df['rm'] = pd.rolling_mean(ser, 10, center=True)

In [113]: df['rm_min_1'] = pd.rolling_mean(ser, 10, center=True, min_periods=1)

In [114]: def movingaverage(interval, window_size):
   .....:         window = numpy.ones(int(window_size))/float(window_size)
   .....:         return numpy.convolve(interval, window, 'same')
   .....: 

In [116]: df['ma'] = movingaverage(ser, 10)

In [117]: df
Out[117]: 
                   0  rc        rm  rm_min_1        ma
2000-01-01  1.177692   5       NaN  0.048306  0.024153
2000-01-02 -0.068567   6       NaN -0.094287 -0.056572
2000-01-03  0.194942   7       NaN -0.225103 -0.157572
2000-01-04 -0.179778   8       NaN -0.160040 -0.128032
2000-01-05 -0.882760   9       NaN -0.262176 -0.235959
2000-01-06 -0.807252  10 -0.250828 -0.250828 -0.250828
2000-01-07 -1.009999  10 -0.365703 -0.365703 -0.365703
2000-01-08  0.295402  10 -0.431110 -0.431110 -0.431110
2000-01-09 -1.079267  10 -0.408968 -0.408968 -0.408968
2000-01-10 -0.148689  10 -0.456946 -0.456946 -0.456946
2000-01-11  0.028935  10 -0.290453 -0.290453 -0.290453
2000-01-12 -0.722632  10 -0.204166 -0.204166 -0.204166
2000-01-13  0.416364  10 -0.175837 -0.175837 -0.175837
2000-01-14 -0.659560  10 -0.266773 -0.266773 -0.266773
2000-01-15  0.782165  10 -0.184003 -0.184003 -0.184003
2000-01-16  0.055617  10 -0.338359 -0.338359 -0.338359
2000-01-17 -0.726706   0       NaN       NaN -0.341252
2000-01-18 -0.613954   0       NaN       NaN -0.268989
2000-01-19 -0.251574   0       NaN       NaN -0.310625
2000-01-20 -1.692243   0       NaN       NaN -0.244669

In [5]: import numpy

In [6]: In [20]: ser = pd.Series(range(10))
   ...:     ...: df = pd.DataFrame(ser)
   ...:     ...: N=3
   ...:     ...: df['rc'] = ser.rolling(N, center=True).count()
   ...:     ...: df['rm'] = ser.rolling(N, center=True).mean()
   ...:     ...: df['rm_min_1'] = ser.rolling(N, center=True, min_periods=1).mean()
   ...:     ...: def movingaverage(interval, window_size):
   ...:     ...:     window = numpy.ones(int(window_size))/float(window_size)
   ...:     ...:     return numpy.convolve(interval, window, 'same')
   ...:     ...:
   ...:     ...: df['ma'] = movingaverage(ser, N)

In [7]: df
Out[7]:
   0   rc   rm  rm_min_1        ma
0  0  2.0  NaN       0.5  0.333333
1  1  3.0  1.0       1.0  1.000000
2  2  3.0  2.0       2.0  2.000000
3  3  3.0  3.0       3.0  3.000000
4  4  3.0  4.0       4.0  4.000000
5  5  3.0  5.0       5.0  5.000000
6  6  3.0  6.0       6.0  6.000000
7  7  3.0  7.0       7.0  7.000000
8  8  3.0  8.0       8.0  8.000000
9  9  2.0  NaN       8.5  5.666667

The text was updated successfully, but these errors were encountered:

ghost · 2013-03-13T00:48:30Z

Let's try this with just a straight integer range as input data:

In [20]: ser = pd.Series(range(10))
    ...: df = pd.DataFrame(ser)
    ...: N=3
    ...: df['rc'] = pd.rolling_count(ser, N, center=True)
    ...: df['rm'] = pd.rolling_mean(ser, N, center=True)
    ...: df['rm_min_1'] = pd.rolling_mean(ser, N, center=True, min_periods=1)
    ...: def movingaverage(interval, window_size):
    ...:     window = numpy.ones(int(window_size))/float(window_size)
    ...:     return numpy.convolve(interval, window, 'same')
    ...: 
    ...: df['ma'] = movingaverage(ser, N)
    ...: df
Out[20]: 
   0  rc  rm  rm_min_1        ma
0  0   2 NaN       0.5  0.333333
1  1   3   1       1.0  1.000000
2  2   3   2       2.0  2.000000
3  3   3   3       3.0  3.000000
4  4   3   4       4.0  4.000000
5  5   3   5       5.0  5.000000
6  6   3   6       6.0  6.000000
7  7   3   7       7.0  7.000000
8  8   3   8       8.0  8.000000
9  9   0 NaN       NaN  5.666667

In [21]:

numpy and pandas agree on points away from the boundary,
and the numpy docstring warns you about boundary effects,
behavior there being a matter of convention rather then definition.

This looks as it should be to me, can you be more specific
about the behaviour you expected to see?

bmu · 2013-03-13T07:42:43Z

I would expect that df.ix[9,'rc'] would be 2 and that df.ix[9, 'rm_min_1'] would be 8.5.

Basically my problem is, that the window size is not computed as expected at the end of the series.
As I have nan's in my data (that's the reason for the min_period in my previous example) I include an nan here:

In [55]: ser = pd.Series(range(10), dtype='float')

In [56]: ser[8] = np.nan

In [57]: pd.rolling_count(ser, 5, center=True)
Out[57]: 
0    3
1    4
2    5
3    5
4    5
5    5
6    4
7    4
8    0
9    0

If the window would be centred around the index I would expect the same results as for ser.ix[i-2:i+2] in
the case of a window size of 5:

In [47]: for i in range(10):
    print 'Count for ser[%2d:%2d]: %d' % (i - 2, i + 2, ser.ix[i-2:i+2].count())
   ....:     
Count for ser[-2: 2]: 3
Count for ser[-1: 3]: 4
Count for ser[ 0: 4]: 5
Count for ser[ 1: 5]: 5
Count for ser[ 2: 6]: 5
Count for ser[ 3: 7]: 5
Count for ser[ 4: 8]: 4
Count for ser[ 5: 9]: 4
Count for ser[ 6:10]: 3
Count for ser[ 7:11]: 2

ghost · 2013-03-13T13:56:32Z

yep, I see it. the start and end are inconsistent in how they treat missing datums.
marking as bug.

ghost · 2013-03-18T15:47:37Z

@jreback , take a look at 8bd09ac, I started working on this but it's tricky
and I won't have time to finish soon.

currently, the way this works is by running the function as usual, and then shifting the result
(in moment.py:_center_window), this leavs missing values at the end, and mis-normalized
values in the beginning (at least for rolling_mean it invalidates the original calc,
see 0.5 instead of 0.333 in example).

padding a numpy array is a problem because of copying, and I was having a hell
of a time generalizing the 1d case I did to nd. plus there are a bunch of broken
tests by this fix, and need to sift to see which are wrong test vectors and which real problems.

If you care, give this a shot, I can't come back to this for a while.

jreback · 2013-03-18T19:24:17Z

@y-p i will take a look....

jreback · 2013-03-18T23:21:50Z

@y-p

ok got almost everything to work, just some minor logic issues, rolling_kurt/skew needed some mods;
added support for ndim>1 was easy

everything passes (only modded 1 test, which I think is right), except rolling_count, I am a bit unclear why its not exactly working, but have to leave you something !

db78efb (my fork, cmov branch)

ghost · 2013-03-19T21:58:48Z

Ok, looked at your fix jeff. you took it further, but still plenty of issues.

In [1]: In [20]: ser = pd.Series(range(10))
   ...:     ...: df = pd.DataFrame(ser)
   ...:     ...: N=3
   ...:     ...: df['rc'] = pd.rolling_count(ser, N, center=True)
   ...:     ...: df['rm'] = pd.rolling_mean(ser, N, center=True)
   ...:     ...: df['rm_min_1'] = pd.rolling_mean(ser, N, center=True, min_periods=1)
   ...:     ...: def movingaverage(interval, window_size):
   ...:     ...:     window = numpy.ones(int(window_size))/float(window_size)
   ...:     ...:     return numpy.convolve(interval, window, 'same')
   ...:     ...: 
   ...:     ...: df['ma'] = movingaverage(ser, N)
   ...:     ...: df
Out[1]: 
   0  rc        rm  rm_min_1        ma
0  0   0  0.333333       NaN  0.333333
1  1   0  1.000000       NaN  1.000000
2  2   1  2.000000         2  2.000000
3  3   2  3.000000         3  3.000000
4  4   3  4.000000         4  4.000000
5  5   3  5.000000         5  5.000000
6  6   3  6.000000         6  6.000000
7  7   3  7.000000         7  7.000000
8  8   3  8.000000         8  8.000000
9  9   2  5.666667       NaN  5.666667

can't explain the weird number formatting, I verified that rm_min_1 has float64 dtype like
the rest of the columns.

the nans should be symmetric in rm_min_1 at both ends (one missing at both ends, or 2, etc')
no nans should occur in rm_min_1 when min_periods=1, 1 (at each end) for (3,2), 3 for(5,4).
and 1 for (4,3) ( ...I think. based on numpy's convention of "center" for even-length windows)
Fairly sure some of the passing tests validate the wrong answer.

jreback · 2013-03-19T22:12:07Z

cherry picked just that commit

ghost · 2013-03-19T22:16:27Z

ok, I'll pick it up where you left off. thanks for taking care of the nd case.

bmu · 2013-03-20T10:06:05Z

at first, thanks for your work.

I just thought about how to "do it right" if we have an even window for a centered moving window: in the "engineering statisitics handbook" of the NIST a double smoothing is done (http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc422.htm).
Not sure, how other libraries (some equivalent functions like rollmean and rollapply are included in the R zoo package) handle this. Also I'm not sure, if ARMA models (statsmodels) use centered windows and how they do it.

I know that this will further complicate these functions (and make them slower, maybe cython would be a better option for computing such windows?), however even windows seems to be quite common (you would usually read "We use a 10-year moving average" instead of "We use an 11-year moving average" in publications).

I don't want to over-complicate things, what do you think about this?

ghost · 2013-03-20T14:22:12Z

We'll definitely need to get the basics down before doing anything more fancy.
the inner loop of these functions is already written in cython, we just need to
fix the edges in a way that doesn't have a big impact. correctness first, then performance
and then enhancements.

ghost · 2013-03-24T21:16:50Z

moving to 0.12, there are a lot of hairy details involved in the fix, it should
sit in review for a while.

ghost · 2013-03-25T14:43:57Z

All tests pass for index nlevels =1.

It's unclear what the right thing to do is with rolling_mean at edges,
should missing data count as an existing value of 0 , or treated like other
missing values, so that mean is computed as sum/n_notna rather then sum/win

The numpy code above treats value "missing" because they are ooutside the array boundries,
and nans in the data itself differently. pandas' supports minp to control the behaviour
of nans in the data itself.

In [18]: def ma(interval, window_size):
    ...:     window = numpy.ones(int(window_size))/float(window_size)
    ...:     return numpy.convolve(interval, window, 'same')
    ...: 
    ...: ser= np.ones(9)
    ...: ser[len(ser)//2] = np.nan
    ...: print list(ma(ser,3))
    ...: print list(pd.rolling_mean(ser,3,2,center=True))
    ...: print list(pd.rolling_mean(ser,3,3,center=True))
    ...: print list(pd.rolling_mean(ser,3,1,center=True,pad_val=0))
[0.66666666666666663, 1.0, 1.0, nan, nan, nan, 1.0, 1.0, 0.66666666666666663]
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
[nan, 1.0, 1.0, nan, nan, nan, 1.0, 1.0, nan]
[0.66666666666666663, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.66666666666666663]

As you can see, it's difficult to reproduce numpy's behaviour, since minp=win
inevitably produces nans at the edges as well as in the center.
Perhaps this is actually more consistent.

pad_val is an API change, and i'm not sure whether it's a good idea. for one thing,
we'll have to backport this to uncentered roll_functions (expanding_* as well?)
and the code is already pretty tangled.

Feedback welcome.

ghost · 2013-03-25T14:48:10Z

@jreback, I cp'd your changes to roll_skew/roll_kurt as-is, I need to know if the
guard you put in against the ZeroDivision needs to have the results validated,
or did you already do that?

jreback · 2013-03-25T15:07:53Z

@y-p I didn't do a test if that's what you are asking, but its already tested by comparing vs kurt/skew in any event
this only comes up in the centering becuase the nobs can change and potentially go to 0, it should have been there before in any event, just probably never got called with 0 nobs

ghost · 2013-03-25T15:51:54Z

Thanks, I'll do a spot check, but should be fine.

ghost · 2013-03-31T21:40:09Z

@bmu, can you try out #3161 and report any issues?

bmu · 2013-04-23T20:56:16Z

I didn't see an issues for my examples and some more tests.

Some thoughts:

not sure about pad_val, maybe using a fixed nan (no argument at all) would be simpler
I can't get how the default min_periods is calculated, maybe this could be documented

bottlenecks moving functions seem to be faster (but there are no centered alternatives)

In [83]: s = pd.Series(range(1000), dtype='float')

In [84]: s[8] = np.nan

In [85]: import bottleneck as bn

In [86]: %timeit pd.rolling_mean(s, 3)
10000 loops, best of 3: 38 us per loop

In [87]: %timeit bn.move_nanmean(s, 3)
100000 loops, best of 3: 7.1 us per loop

As bottleneck is a recommended dependency it could be used in the cases where
center=False.

(However I`m happy with your solution as it is)

arw2019 · 2020-09-20T19:57:31Z

On 1.2 master:

In [8]: import numpy as np 
   ...: import pandas as pd 
   ...:  
   ...: ser = pd.Series(range(10)) 
   ...: df = pd.DataFrame(ser) 
   ...: N=3 
   ...: df['rc'] = ser.rolling(N, center=True).count() 
   ...: df['rm'] = ser.rolling(N, center=True).mean() 
   ...: df['rm_min_1'] = ser.rolling(N, center=True, min_periods=1).mean() 
   ...: def movingaverage(interval, window_size): 
   ...:     window = np.ones(int(window_size))/float(window_size) 
   ...:     return np.convolve(interval, window, 'same') 
   ...:  
   ...: df['ma'] = movingaverage(ser, N) 
   ...: df                                                                                                                                                                                                        
Out[8]: 
   0   rc   rm  rm_min_1        ma
0  0  2.0  NaN       0.5  0.333333
1  1  3.0  1.0       1.0  1.000000
2  2  3.0  2.0       2.0  2.000000
3  3  3.0  3.0       3.0  3.000000
4  4  3.0  4.0       4.0  4.000000
5  5  3.0  5.0       5.0  5.000000
6  6  3.0  6.0       6.0  6.000000
7  7  3.0  7.0       7.0  7.000000
8  8  3.0  8.0       8.0  8.000000

Output of pd.show_versions()

INSTALLED VERSIONS

commit : a22cf43
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-47-generic
Version : #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020
machine : x86_64
processor :
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0.dev0+446.ga22cf439e
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.1.0.post20200704
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.19.0
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : 0.4.0
gcsfs : 0.6.2
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.0
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.50.1

mroeschke · 2020-12-28T00:07:08Z

It's not entirely clear what's left to do with this issue. We can reopen if there's a focused example of the remaining buggy behavior

ghost assigned changhiskhan Mar 6, 2013

ghost mentioned this issue Mar 18, 2013

ENH: improve performance of df.to_csv GH3054 #3059

Merged

ghost closed this as completed Mar 18, 2013

ghost reopened this Mar 18, 2013

ghost mentioned this issue Mar 19, 2013

Centered moving window functions #1270

Closed

This was referenced Mar 24, 2013

rolling_corr not in [-1,1] when using 'center' flag #3155

Closed

Bug in rolling_mean when minp == win #3157

Closed

ghost mentioned this issue Mar 24, 2013

WIP: fixes to rolling_window functions with center=True #3161

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Mar 9, 2014

jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015

jreback modified the milestones: 0.17.0, 0.16.0 Jan 26, 2015

jreback mentioned this issue Oct 14, 2016

incorrect calculation of centered moving averages for even length series #14425

Closed

mroeschke added the Window rolling, ewma, expanding label Oct 20, 2019

mroeschke closed this as completed Dec 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: centered moving windows mishandle series edges #2953

BUG: centered moving windows mishandle series edges #2953

bmu commented Mar 1, 2013 •

edited by mroeschke

Loading

ghost commented Mar 13, 2013

bmu commented Mar 13, 2013

ghost commented Mar 13, 2013

ghost commented Mar 18, 2013

jreback commented Mar 18, 2013

jreback commented Mar 18, 2013

ghost commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

bmu commented Mar 20, 2013

ghost commented Mar 20, 2013

ghost commented Mar 24, 2013

ghost commented Mar 25, 2013

ghost commented Mar 25, 2013

jreback commented Mar 25, 2013

ghost commented Mar 25, 2013

ghost commented Mar 31, 2013

bmu commented Apr 23, 2013

arw2019 commented Sep 20, 2020

INSTALLED VERSIONS

mroeschke commented Dec 28, 2020

BUG: centered moving windows mishandle series edges #2953

BUG: centered moving windows mishandle series edges #2953

Comments

bmu commented Mar 1, 2013 • edited by mroeschke Loading

ghost commented Mar 13, 2013

bmu commented Mar 13, 2013

ghost commented Mar 13, 2013

ghost commented Mar 18, 2013

jreback commented Mar 18, 2013

jreback commented Mar 18, 2013

ghost commented Mar 19, 2013

jreback commented Mar 19, 2013

ghost commented Mar 19, 2013

bmu commented Mar 20, 2013

ghost commented Mar 20, 2013

ghost commented Mar 24, 2013

ghost commented Mar 25, 2013

ghost commented Mar 25, 2013

jreback commented Mar 25, 2013

ghost commented Mar 25, 2013

ghost commented Mar 31, 2013

bmu commented Apr 23, 2013

arw2019 commented Sep 20, 2020

INSTALLED VERSIONS

mroeschke commented Dec 28, 2020

bmu commented Mar 1, 2013 •

edited by mroeschke

Loading