-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: centered moving windows mishandle series edges #2953
Comments
Let's try this with just a straight integer range as input data: In [20]: ser = pd.Series(range(10))
...: df = pd.DataFrame(ser)
...: N=3
...: df['rc'] = pd.rolling_count(ser, N, center=True)
...: df['rm'] = pd.rolling_mean(ser, N, center=True)
...: df['rm_min_1'] = pd.rolling_mean(ser, N, center=True, min_periods=1)
...: def movingaverage(interval, window_size):
...: window = numpy.ones(int(window_size))/float(window_size)
...: return numpy.convolve(interval, window, 'same')
...:
...: df['ma'] = movingaverage(ser, N)
...: df
Out[20]:
0 rc rm rm_min_1 ma
0 0 2 NaN 0.5 0.333333
1 1 3 1 1.0 1.000000
2 2 3 2 2.0 2.000000
3 3 3 3 3.0 3.000000
4 4 3 4 4.0 4.000000
5 5 3 5 5.0 5.000000
6 6 3 6 6.0 6.000000
7 7 3 7 7.0 7.000000
8 8 3 8 8.0 8.000000
9 9 0 NaN NaN 5.666667
In [21]: numpy and pandas agree on points away from the boundary, This looks as it should be to me, can you be more specific |
I would expect that Basically my problem is, that the window size is not computed as expected at the end of the series. In [55]: ser = pd.Series(range(10), dtype='float')
In [56]: ser[8] = np.nan
In [57]: pd.rolling_count(ser, 5, center=True)
Out[57]:
0 3
1 4
2 5
3 5
4 5
5 5
6 4
7 4
8 0
9 0 If the window would be centred around the index I would expect the same results as for In [47]: for i in range(10):
print 'Count for ser[%2d:%2d]: %d' % (i - 2, i + 2, ser.ix[i-2:i+2].count())
....:
Count for ser[-2: 2]: 3
Count for ser[-1: 3]: 4
Count for ser[ 0: 4]: 5
Count for ser[ 1: 5]: 5
Count for ser[ 2: 6]: 5
Count for ser[ 3: 7]: 5
Count for ser[ 4: 8]: 4
Count for ser[ 5: 9]: 4
Count for ser[ 6:10]: 3
Count for ser[ 7:11]: 2 |
yep, I see it. the start and end are inconsistent in how they treat missing datums. |
@jreback , take a look at 8bd09ac, I started working on this but it's tricky currently, the way this works is by running the function as usual, and then shifting the result padding a numpy array is a problem because of copying, and I was having a hell If you care, give this a shot, I can't come back to this for a while. |
@y-p i will take a look.... |
ok got almost everything to work, just some minor logic issues, rolling_kurt/skew needed some mods; everything passes (only modded 1 test, which I think is right), except rolling_count, I am a bit unclear why its not exactly working, but have to leave you something ! db78efb (my fork, cmov branch) |
Ok, looked at your fix jeff. you took it further, but still plenty of issues. In [1]: In [20]: ser = pd.Series(range(10))
...: ...: df = pd.DataFrame(ser)
...: ...: N=3
...: ...: df['rc'] = pd.rolling_count(ser, N, center=True)
...: ...: df['rm'] = pd.rolling_mean(ser, N, center=True)
...: ...: df['rm_min_1'] = pd.rolling_mean(ser, N, center=True, min_periods=1)
...: ...: def movingaverage(interval, window_size):
...: ...: window = numpy.ones(int(window_size))/float(window_size)
...: ...: return numpy.convolve(interval, window, 'same')
...: ...:
...: ...: df['ma'] = movingaverage(ser, N)
...: ...: df
Out[1]:
0 rc rm rm_min_1 ma
0 0 0 0.333333 NaN 0.333333
1 1 0 1.000000 NaN 1.000000
2 2 1 2.000000 2 2.000000
3 3 2 3.000000 3 3.000000
4 4 3 4.000000 4 4.000000
5 5 3 5.000000 5 5.000000
6 6 3 6.000000 6 6.000000
7 7 3 7.000000 7 7.000000
8 8 3 8.000000 8 8.000000
9 9 2 5.666667 NaN 5.666667 can't explain the weird number formatting, I verified that
|
cherry picked just that commit |
ok, I'll pick it up where you left off. thanks for taking care of the nd case. |
at first, thanks for your work. I just thought about how to "do it right" if we have an even window for a centered moving window: in the "engineering statisitics handbook" of the NIST a double smoothing is done (http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc422.htm). I know that this will further complicate these functions (and make them slower, maybe cython would be a better option for computing such windows?), however even windows seems to be quite common (you would usually read "We use a 10-year moving average" instead of "We use an 11-year moving average" in publications). I don't want to over-complicate things, what do you think about this? |
We'll definitely need to get the basics down before doing anything more fancy. |
moving to 0.12, there are a lot of hairy details involved in the fix, it should |
All tests pass for index nlevels =1. It's unclear what the right thing to do is with The numpy code above treats value "missing" because they are ooutside the array boundries, In [18]: def ma(interval, window_size):
...: window = numpy.ones(int(window_size))/float(window_size)
...: return numpy.convolve(interval, window, 'same')
...:
...: ser= np.ones(9)
...: ser[len(ser)//2] = np.nan
...: print list(ma(ser,3))
...: print list(pd.rolling_mean(ser,3,2,center=True))
...: print list(pd.rolling_mean(ser,3,3,center=True))
...: print list(pd.rolling_mean(ser,3,1,center=True,pad_val=0))
[0.66666666666666663, 1.0, 1.0, nan, nan, nan, 1.0, 1.0, 0.66666666666666663]
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
[nan, 1.0, 1.0, nan, nan, nan, 1.0, 1.0, nan]
[0.66666666666666663, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.66666666666666663] As you can see, it's difficult to reproduce numpy's behaviour, since
Feedback welcome. |
@jreback, I cp'd your changes to roll_skew/roll_kurt as-is, I need to know if the |
@y-p I didn't do a test if that's what you are asking, but its already tested by comparing vs kurt/skew in any event |
Thanks, I'll do a spot check, but should be fine. |
I didn't see an issues for my examples and some more tests. Some thoughts:
(However I`m happy with your solution as it is) |
On 1.2 master: In [8]: import numpy as np
...: import pandas as pd
...:
...: ser = pd.Series(range(10))
...: df = pd.DataFrame(ser)
...: N=3
...: df['rc'] = ser.rolling(N, center=True).count()
...: df['rm'] = ser.rolling(N, center=True).mean()
...: df['rm_min_1'] = ser.rolling(N, center=True, min_periods=1).mean()
...: def movingaverage(interval, window_size):
...: window = np.ones(int(window_size))/float(window_size)
...: return np.convolve(interval, window, 'same')
...:
...: df['ma'] = movingaverage(ser, N)
...: df
Out[8]:
0 rc rm rm_min_1 ma
0 0 2.0 NaN 0.5 0.333333
1 1 3.0 1.0 1.0 1.000000
2 2 3.0 2.0 2.0 2.000000
3 3 3.0 3.0 3.0 3.000000
4 4 3.0 4.0 4.0 4.000000
5 5 3.0 5.0 5.0 5.000000
6 6 3.0 6.0 6.0 6.000000
7 7 3.0 7.0 7.0 7.000000
8 8 3.0 8.0 8.0 8.000000 Output of pd.show_versions()INSTALLED VERSIONScommit : a22cf43 pandas : 1.2.0.dev0+446.ga22cf439e |
It's not entirely clear what's left to do with this issue. We can reopen if there's a focused example of the remaining buggy behavior |
Seems like there is a bug in the centered mooving window functions, although the examples here don't show it.
Also
np.convolve
used like in this post on Stackoverflow shows different results for the first values:The text was updated successfully, but these errors were encountered: