Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored Resample API breaking change #11841

Closed
wants to merge 8 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Dec 14, 2015

on top of #11603

closes #11732
closes #12072
closes #9052
closes #12140

ToDo:

  • rewrite/expand main docs
  • add aggregate section

New API

In [4]: np.random.seed(1234)

In [5]: df = pd.DataFrame(np.random.rand(10,4),
                     columns=list('ABCD'),
                     index=pd.date_range('2010-01-01 09:00:00', periods=10, freq='s'))

In [6]: df
Out[6]: 
                            A         B         C         D
2010-01-01 09:00:00  0.191519  0.622109  0.437728  0.785359
2010-01-01 09:00:01  0.779976  0.272593  0.276464  0.801872
2010-01-01 09:00:02  0.958139  0.875933  0.357817  0.500995
2010-01-01 09:00:03  0.683463  0.712702  0.370251  0.561196
2010-01-01 09:00:04  0.503083  0.013768  0.772827  0.882641
2010-01-01 09:00:05  0.364886  0.615396  0.075381  0.368824
2010-01-01 09:00:06  0.933140  0.651378  0.397203  0.788730
2010-01-01 09:00:07  0.316836  0.568099  0.869127  0.436173
2010-01-01 09:00:08  0.802148  0.143767  0.704261  0.704581
2010-01-01 09:00:09  0.218792  0.924868  0.442141  0.909316

In [7]: df.resample('2s')
Out[7]: DatetimeIndexResampler [freq=<2 * Seconds>,axis=0,closed=left,label=left,convention=start,base=0]

In [8]: r = df.resample('2s')

In [9]: r.sum()
Out[9]: 
                            A         B         C         D
2010-01-01 09:00:00  0.971495  0.894701  0.714192  1.587231
2010-01-01 09:00:02  1.641602  1.588635  0.728068  1.062191
2010-01-01 09:00:04  0.867969  0.629165  0.848208  1.251465
2010-01-01 09:00:06  1.249976  1.219477  1.266330  1.224904
2010-01-01 09:00:08  1.020940  1.068634  1.146402  1.613897

In [10]: r[['A','B']].agg(['mean','sum'])
Out[10]: 
                            A                   B          
                         mean       sum      mean       sum
2010-01-01 09:00:00  0.485748  0.971495  0.447351  0.894701
2010-01-01 09:00:02  0.820801  1.641602  0.794317  1.588635
2010-01-01 09:00:04  0.433985  0.867969  0.314582  0.629165
2010-01-01 09:00:06  0.624988  1.249976  0.609738  1.219477
2010-01-01 09:00:08  0.510470  1.020940  0.534317  1.068634

Upsampling

In [11]: s = Series(np.arange(5,dtype='int64'),
   ....:               index=date_range('2010-01-01', periods=5, freq='Q'))

In [12]: s
Out[12]: 
2010-03-31    0
2010-06-30    1
2010-09-30    2
2010-12-31    3
2011-03-31    4
Freq: Q-DEC, dtype: int64

In [13]: s.resample('M').ffill()
Out[13]: 
2010-03-31    0
2010-04-30    0
2010-05-31    0
2010-06-30    1
2010-07-31    1
2010-08-31    1
2010-09-30    2
2010-10-31    2
2010-11-30    2
2010-12-31    3
2011-01-31    3
2011-02-28    3
2011-03-31    4
Freq: M, dtype: int64

In [14]: s.resample('M').asfreq()
Out[14]: 
2010-03-31     0
2010-04-30   NaN
2010-05-31   NaN
2010-06-30     1
2010-07-31   NaN
2010-08-31   NaN
2010-09-30     2
2010-10-31   NaN
2010-11-30   NaN
2010-12-31     3
2011-01-31   NaN
2011-02-28   NaN
2011-03-31     4
Freq: M, dtype: float64

@jreback jreback added API Design Resample resample method labels Dec 14, 2015
@jreback jreback added this to the 0.18.0 milestone Dec 14, 2015
@jorisvandenbossche
Copy link
Member

Although I really like the fact of a more consistent API, I think this is so backwards incompatible that it really is problematic to just put it in a release like this.

I also kind of like the simplicity of a basic resample to explain in tutorials (although once it is not a basic resample anymore, this interface is nicer). A two-step method is more complicated than a 1-step..

Regarding to the back compat issue, there are maybe other ways to solve this? Eg, I think of using another name, some keyword indicating the behaviour (but this will get ugly), ....

@jreback
Copy link
Contributor Author

jreback commented Dec 15, 2015

Although I really like the fact of a more consistent API, I think this is so backwards incompatible that it really is problematic to just put it in a release like this.

pls explain. The point of breaking compat is to change future behavior. breaking changes always have short term pain, but shying away from actual long time needed inconsistencies is MUCH MUCH worse.

@jreback
Copy link
Contributor Author

jreback commented Dec 15, 2015

yes we could add a .resample2 method which is the new impl, and preserve the original with a deprecation warning. Then of course we'd have to maintain .resample2 for a bit.

@shoyer
Copy link
Member

shoyer commented Dec 15, 2015

One thing we could do to help preserve backwards compat is to keep around how for now as a deprecated argument. If how is set, raise a deprecation warning and preserve the original behavior. This would at least keep cases like s.resample('24H', how='max') working, though it would indeed break s.resample('24H').

@jreback
Copy link
Contributor Author

jreback commented Dec 15, 2015

I already do the deprecation warning (and just return the result) if how is actually specified.

The only case this is actually breaking is if NO how is specified at all (which is prob very common)

In [1]: s = Series(range(5))

In [2]: s = Series(range(5),index=date_range('20130101',periods=5,freq='s'))

In [3]: s.resample('D',how='min')
/Users/jreback/miniconda/bin/ipython:1: FutureWarning: how is .resample()
the new syntax is .resample(...).min()
  #!/bin/bash /Users/jreback/miniconda/bin/python.app
Out[3]: 
2013-01-01    0
Freq: D, dtype: int64

In [4]: s.resample('D')
Out[4]: DatetimeIndexResampler [freq-><Day>,axis->0,closed->left,label->left,convention->start,base->0]

@shoyer
Copy link
Member

shoyer commented Dec 15, 2015

This might be too magical and/or tricky to pull off, but we might add fallback methods to Resampler corresponding to the Series/DataFrame API that issue a deprecation warning, and then call .mean() followed by the desired operation.

@jreback
Copy link
Contributor Author

jreback commented Dec 16, 2015

hacking Resampler.__getattr__ gets almost all the way there, see here

In [1]: s = Series(range(5),index=date_range('20130101',periods=5,freq='s'))

In [2]: s.resample('D').ix[0]
pandas/tseries/resample.py:73: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
  return getattr(self._deprecate_api(),attr)
Out[2]: 2

In [2]: s.resample('D').sort_values()
pandas/tseries/resample.py:73: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
  return getattr(self._deprecate_api(),attr)
Out[2]: 
2013-01-01    2
dtype: int64

@jreback jreback force-pushed the resample branch 2 times, most recently from 86b47f2 to 008c7b4 Compare December 16, 2015 01:43
@shoyer
Copy link
Member

shoyer commented Dec 16, 2015

@jreback nice. One thing we'll have to catch explicitly is __setitem__ and other assignment or in-place operations. We need to catch those operations so they don't silently fail.

@jreback
Copy link
Contributor Author

jreback commented Dec 16, 2015

so defined all of the arithmetic ops / comparison ops and instance checking
seems that we can masquerade as a Series (or DataFrame) pretty much, or at least
to work and show the deprecation warning.

In [1]: s = Series(range(5),index=date_range('20130101',periods=5,freq='s'))

In [2]: r = s.resample('H')

In [3]: r>2
pandas/tseries/resample.py:80: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
  result = self._deprecated()
Out[3]: 
2013-01-01    False
Freq: H, dtype: bool

In [3]: r*2
pandas/tseries/resample.py:80: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
  result = self._deprecated()
Out[3]: 
2013-01-01    4
Freq: H, dtype: int64

@jreback
Copy link
Contributor Author

jreback commented Dec 16, 2015

so its actually quite a tricky problem to catch this. imagine you had this code originally

r = series.resample('H')
r.iloc[0] = 5

The warning will show up, but the setting is actually on a copy. I prob should just raise.

@shoyer
Copy link
Member

shoyer commented Dec 16, 2015

The warning will show up, but the setting is actually on a copy. I prob should just raise.

Indeed, this needs to raise. I would say this raising is more important than adding the Series/DataFrame facade. If we don't have a facade, things will fail loudly, which is unfortunate, but if we have a broken facade, things will fail silently!

@jreback
Copy link
Contributor Author

jreback commented Dec 17, 2015

@jorisvandenbossche how are we removing deprecation warnings on older whatsnew files?
I can just change the code I guess

@jorisvandenbossche
Copy link
Member

@jreback See #6856 (comment) for general discussion on this.
I am personally in favor of converting the older whatsnew files to static code-blocks (and acknowledging that it are 'historic' pages that do not necessarily reflect the current state of the package).

For now, adding :okwarning: to the ipython blocks will suppress the warnings during doc building.

@jorisvandenbossche
Copy link
Member

Although I really like the fact of a more consistent API, I think this is so backwards incompatible that it really is problematic to just put it in a release like this.

pls explain. The point of breaking compat is to change future behavior. breaking changes always have short term pain, but shying away from actual long time needed inconsistencies is MUCH MUCH worse.

To answer this: of course we have to keep a right balance between avoiding breaking changes and improving pandas. But that is not always very clear what the right balance is, and I could also imagine other parts of pandas that could use some breaking changes, but that we are reluctant about to do because of its impact (not to mention the indexing api ... :-))
And most API changes we do are not breaking changes (but deprecations), and are most of the time more corner cases (but OK, I work with time series and use resample a lot, so maybe I am a bit biased)

But, as you now updated to include a deprecation warning or fail with an informative error, it sounds already a lot better to me! (at least there will be no silent failure, that's probably the most important)

Maybe I will ping the mailing list about this, as it is a rather large change?

@jreback
Copy link
Contributor Author

jreback commented Dec 18, 2015

@jorisvandenbossche you e-mail to the list is good. thanks.

I agree about most changes, but some things just need to be more consistent, this and the window functions are in the groupby pattern (which is quite in-grained in pandas). So to me this is a no-brainer, EVEN if we had to break back-compat (in this case was able to provide a nice easy upgrade path so that makes this pretty easy to swallow IMHO).

if you want to review, this is also ready to go.

@jreback jreback force-pushed the resample branch 4 times, most recently from cae173b to b83dae1 Compare December 23, 2015 18:02
@jorisvandenbossche
Copy link
Member

Some more usage feedback:

  • Custom agg/apply is broken for PeriodIndex:

    s2 = pd.Series(np.random.randint(0,5,50),
               index=pd.period_range('2012-01-01', freq='H', periods=50))
    
    In [29]: s2.resample('D').agg(lambda x: x.mean())
    AttributeError: 'PeriodIndexResampler' object has no attribute 'grouper'
    

    This worked before, so was probably not tested.

  • reindex is doing something strange:

    In [37]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
    
    In [38]: ts = pd.Series(np.arange(len(rng)), index=rng)
    
    In [40]: ts.resample('15s').reindex()
    Out[40]:
    2012-01-01 00:00:00     7.0
    2012-01-01 00:00:15    22.0
    2012-01-01 00:00:30    37.0
    2012-01-01 00:00:45    52.0
    2012-01-01 00:01:00    67.0
    2012-01-01 00:01:15    82.0
    2012-01-01 00:01:30    94.5
    Freq: 15S, dtype: float64
    
    In [42]: ts.resample('15s').mean()
    Out[42]:
    2012-01-01 00:00:00     7.0
    2012-01-01 00:00:15    22.0
    2012-01-01 00:00:30    37.0
    2012-01-01 00:00:45    52.0
    2012-01-01 00:01:00    67.0
    2012-01-01 00:01:15    82.0
    2012-01-01 00:01:30    94.5
    Freq: 15S, dtype: float64
    
    In [43]: ts.resample('15s').asfreq()
    Out[43]:
    2012-01-01 00:00:00     0
    2012-01-01 00:00:15    15
    2012-01-01 00:00:30    30
    2012-01-01 00:00:45    45
    2012-01-01 00:01:00    60
    2012-01-01 00:01:15    75
    2012-01-01 00:01:30    90
    Freq: 15S, dtype: int32
    

    So it returns the same as mean, although I would rather expect the same as asfreq ?

  • Combined up and downsampling (due to an irregular time series) has changed behaviour. Using:

    rng = pd.date_range('1/1/2012', periods=100, freq='S')
    ts = pd.Series(np.arange(len(rng)), index=rng)
    ts2 = ts.iloc[[0,1,2,3,5,7,11,15,16,25,30]]
    

    With 0.17.1:

    In [32]: pd.__version__
    Out[32]: u'0.17.1'
    
    In [33]: ts2.resample('2s', how='mean', fill_method='ffill')
    Out[33]:
    2012-01-01 00:00:00     0.5
    2012-01-01 00:00:02     2.5
    2012-01-01 00:00:04     5.0
    2012-01-01 00:00:06     7.0
    2012-01-01 00:00:08     7.0
    2012-01-01 00:00:10    11.0
    2012-01-01 00:00:12    11.0
    2012-01-01 00:00:14    15.0
    2012-01-01 00:00:16    16.0
    2012-01-01 00:00:18    16.0
    2012-01-01 00:00:20    16.0
    2012-01-01 00:00:22    16.0
    2012-01-01 00:00:24    25.0
    2012-01-01 00:00:26    25.0
    2012-01-01 00:00:28    25.0
    2012-01-01 00:00:30    30.0
    Freq: 2S, dtype: float64
    
    In [34]: ts2.resample('2s', how='mean').ffill()
    Out[34]:
    2012-01-01 00:00:00     0.5
    2012-01-01 00:00:02     2.5
    2012-01-01 00:00:04     5.0
    2012-01-01 00:00:06     7.0
    2012-01-01 00:00:08     7.0
    2012-01-01 00:00:10    11.0
    2012-01-01 00:00:12    11.0
    2012-01-01 00:00:14    15.0
    2012-01-01 00:00:16    16.0
    2012-01-01 00:00:18    16.0
    2012-01-01 00:00:20    16.0
    2012-01-01 00:00:22    16.0
    2012-01-01 00:00:24    25.0
    2012-01-01 00:00:26    25.0
    2012-01-01 00:00:28    25.0
    2012-01-01 00:00:30    30.0
    Freq: 2S, dtype: float64
    

    With this branch:

    In [64]: pd.__version__
    Out[64]: '0.17.1+283.gd7c3efb'
    
    In [65]: ts2.resample('2s', how='mean', fill_method='ffill')
    C:\Anaconda\envs\devel\Scripts\ipython-script.py:1: FutureWarning: fill_method
    s deprecated to .resample()
    the new syntax is .resample(...).ffill()
    if __name__ == '__main__':
    Out[65]:
    2012-01-01 00:00:00     0
    2012-01-01 00:00:02     2
    2012-01-01 00:00:04     3
    2012-01-01 00:00:06     5
    2012-01-01 00:00:08     7
    2012-01-01 00:00:10     7
    2012-01-01 00:00:12    11
    2012-01-01 00:00:14    11
    2012-01-01 00:00:16    16
    2012-01-01 00:00:18    16
    2012-01-01 00:00:20    16
    2012-01-01 00:00:22    16
    2012-01-01 00:00:24    16
    2012-01-01 00:00:26    25
    2012-01-01 00:00:28    25
    2012-01-01 00:00:30    30
    Freq: 2S, dtype: int32
    
    In [66]: ts2.resample('2s').mean().ffill()
    Out[66]:
    2012-01-01 00:00:00     0.5
    2012-01-01 00:00:02     2.5
    2012-01-01 00:00:04     5.0
    2012-01-01 00:00:06     7.0
    2012-01-01 00:00:08     7.0
    2012-01-01 00:00:10    11.0
    2012-01-01 00:00:12    11.0
    2012-01-01 00:00:14    15.0
    2012-01-01 00:00:16    16.0
    2012-01-01 00:00:18    16.0
    2012-01-01 00:00:20    16.0
    2012-01-01 00:00:22    16.0
    2012-01-01 00:00:24    25.0
    2012-01-01 00:00:26    25.0
    2012-01-01 00:00:28    25.0
    2012-01-01 00:00:30    30.0
    Freq: 2S, dtype: float64
    
    In [68]: ts2.resample('2s').ffill().resample('2s').mean()  # this gives the same result as  ts2.resample('2s', how='mean', fill_method='ffill'), but is thus not the same as in 0.17.1
    Out[68]:
    2012-01-01 00:00:00     0
    2012-01-01 00:00:02     2
    2012-01-01 00:00:04     3
    2012-01-01 00:00:06     5
    2012-01-01 00:00:08     7
    2012-01-01 00:00:10     7
    2012-01-01 00:00:12    11
    2012-01-01 00:00:14    11
    2012-01-01 00:00:16    16
    2012-01-01 00:00:18    16
    2012-01-01 00:00:20    16
    2012-01-01 00:00:22    16
    2012-01-01 00:00:24    16
    2012-01-01 00:00:26    25
    2012-01-01 00:00:28    25
    2012-01-01 00:00:30    30
    Freq: 2S, dtype: int32
    
  • groups and indices properties give a AttributeError: 'DatetimeIndexResampler' object has no attribute 'grouper' if you did not yet use the Resampler object (so grouper not yet initialized):

    In [104]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
    
    In [105]: ts = pd.Series(np.arange(len(rng)), index=rng)
    
    In [110]: rs = ts.resample('30s')
    
    In [111]: rs.groups
    AttributeError: 'DatetimeIndexResampler' object has no attribute 'grouper'
    
    In [112]: rs.mean()
    Out[112]:
    2012-01-01 00:00:00    14.5
    2012-01-01 00:00:30    44.5
    2012-01-01 00:01:00    74.5
    2012-01-01 00:01:30    94.5
    Freq: 30S, dtype: float64
    
    In [113]: rs.groups
    Out[113]:
    {Timestamp('2012-01-01 00:00:00', offset='30S'): 30,
    Timestamp('2012-01-01 00:00:30', offset='30S'): 60,
    Timestamp('2012-01-01 00:01:00', offset='30S'): 90,
    Timestamp('2012-01-01 00:01:30', offset='30S'): 100}
    

@jorisvandenbossche
Copy link
Member

Ah, the reindex comment from above is just because this was passed through the underlying deprecated ts.resample('15s', how='mean')evaluated series, ands.reindex()` without passing args just returns s.
But, you did use it somewhere in the docs, so I suppose this was a mistake there?

@jreback
Copy link
Contributor Author

jreback commented Jan 26, 2016

yes the .reindex as I noted above was a typo
addressing all of the others.

@jreback
Copy link
Contributor Author

jreback commented Jan 26, 2016

@jorisvandenbossche ok, latest push should fix everything you mentioned here (except for a couple of issues I have marked in #12140)

@jreback
Copy link
Contributor Author

jreback commented Jan 29, 2016

@jorisvandenbossche if you'd have a final look.

@jreback
Copy link
Contributor Author

jreback commented Feb 1, 2016

any more comments @jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member

@jreback Thanks for all the edits based on my comments. I quickly looked at some of them, and all looks good! Only responded to two of them (both about the whatsnew).

I don't have the time at the moment to take a final look, but I trust my previous rounds of comments and your edits that this is good to go!

My only issue is still the behavior with dicts in agg, as we discussed above but did not really reach consensus about. But maybe this shouldn't hold up merging this, as it would be good to let this be in master for still some time.
I will try to recapitulate my concerns later today in a bit more clear way (as your last comment was 'not sure what you are saying' :-)).

@jreback
Copy link
Contributor Author

jreback commented Feb 2, 2016

@jorisvandenbossche ok, I took out the aggregation clarification docs, they are the same as existing (my mistake), expect for a slightly better error message. The behavior has not changed AFICT from 0.17.1.

@jreback jreback closed this in 1dc49f5 Feb 2, 2016
@wesm
Copy link
Member

wesm commented Feb 2, 2016

yay! let's get flake8 pandas completely clean now and flip the switch on Travis?

@jreback
Copy link
Contributor Author

jreback commented Feb 2, 2016

@wesm almost done: #12208

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Resample resample method
Projects
None yet
4 participants