Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

Merged
merged 26 commits into from
May 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
3aa2ae3
ENH: add 'origin and 'offset' arguments to 'resample' and 'pd.Grouper'
hasB4K Feb 8, 2020
c16733b
ENH: change how the warning are handled for base and loffset
hasB4K Feb 15, 2020
9b000d7
TST: move deprecated tests of loffset and base into test_deprecated
hasB4K Feb 15, 2020
6c03bf4
ENH: check if origin for resample is timezone aware when needed
hasB4K Feb 16, 2020
532cdfe
DOC: add 'Use or to adjust the start of the bins' section into time…
hasB4K Feb 16, 2020
f158cdb
DOC: simplify doc for What's new and add a comment on the deprecation…
hasB4K Feb 16, 2020
78ed64c
DOC: Add example for origin and offset in resample and in pd.Grouper
hasB4K Feb 16, 2020
77f507d
CLN: review clean part one
hasB4K Mar 31, 2020
426d8c7
CLN: review clean part two
hasB4K Mar 31, 2020
bad3ed6
DOC: update documentation to be more clearer (review part 3)
hasB4K Mar 31, 2020
687429e
CLN: review fix - move warning of 'loffset' and 'base' into pd.Grouper
hasB4K Apr 4, 2020
7d4de49
CLN: fix lint issue with isort
hasB4K Apr 4, 2020
b83c5bf
Update pandas/core/generic.py
hasB4K Apr 9, 2020
3e24d53
CLN: add TimestampCompatibleTypes and TimedeltaCompatibleTypes in pan…
hasB4K Apr 11, 2020
c2ee661
CLN: fix lint issue with isort
hasB4K Apr 11, 2020
a6e94c0
ENH: support 'epoch', 'start_day' and 'start' for origin
hasB4K Apr 11, 2020
53802e5
DOC: add doc for origin that uses 'epoch', 'start' or 'start_day'
hasB4K Apr 12, 2020
3fc2bf6
TST: add test for origin that uses 'epoch', 'start' or 'start_day'
hasB4K Apr 12, 2020
4ad979a
BUG: fix a timezone bug between origin and index on df.resample
hasB4K Apr 12, 2020
343a30a
DOC: change doc after review
hasB4K May 1, 2020
efb572e
CLN: change typing for TimestampConvertibleTypes
hasB4K May 1, 2020
fcdde91
CLN: add nice message for ValueError of 'origin' and 'offset' in resa…
hasB4K May 1, 2020
1fec946
BUG: fix a bug when resampling in DST context
hasB4K May 2, 2020
5695ffb
TST: fix deprecation test
hasB4K May 2, 2020
de6b477
TST: using pytz instead of datetutil in test of test_resample_origin_…
hasB4K May 2, 2020
05ddd9b
CLN: remove unused import
hasB4K May 9, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 55 additions & 6 deletions doc/source/user_guide/timeseries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1572,19 +1572,16 @@ end of the interval is closed:

ts.resample('5Min', closed='left').mean()

Parameters like ``label`` and ``loffset`` are used to manipulate the resulting
labels. ``label`` specifies whether the result is labeled with the beginning or
the end of the interval. ``loffset`` performs a time adjustment on the output
labels.
Parameters like ``label`` are used to manipulate the resulting labels.
hasB4K marked this conversation as resolved.
Show resolved Hide resolved
``label`` specifies whether the result is labeled with the beginning or
the end of the interval.

.. ipython:: python

ts.resample('5Min').mean() # by default label='left'

ts.resample('5Min', label='left').mean()

ts.resample('5Min', label='left', loffset='1s').mean()

.. warning::

The default values for ``label`` and ``closed`` is '**left**' for all
Expand Down Expand Up @@ -1789,6 +1786,58 @@ natural and functions similarly to :py:func:`itertools.groupby`:

See :ref:`groupby.iterating-label` or :class:`Resampler.__iter__` for more.

.. _timeseries.adjust-the-start-of-the-bins:

hasB4K marked this conversation as resolved.
Show resolved Hide resolved
Use `origin` or `offset` to adjust the start of the bins
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 1.1.0

The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like `30D`) or that divide a day evenly (like `90s` or `1min`). This can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can specify a fixed Timestamp with the argument ``origin``.

For example:

.. ipython:: python

start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
middle = '2000-10-02 00:00:00'
rng = pd.date_range(start, end, freq='7min')
ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
ts

Here we can see that, when using ``origin`` with its default value (``'start_day'``), the result after ``'2000-10-02 00:00:00'`` are not identical depending on the start of time series:

.. ipython:: python

ts.resample('17min', origin='start_day').sum()
ts[middle:end].resample('17min', origin='start_day').sum()


Here we can see that, when setting ``origin`` to ``'epoch'``, the result after ``'2000-10-02 00:00:00'`` are identical depending on the start of time series:

.. ipython:: python

ts.resample('17min', origin='epoch').sum()
ts[middle:end].resample('17min', origin='epoch').sum()


If needed you can use a custom timestamp for ``origin``:

.. ipython:: python

ts.resample('17min', origin='2001-01-01').sum()
ts[middle:end].resample('17min', origin=pd.Timestamp('2001-01-01')).sum()

If needed you can just adjust the bins with an ``offset`` Timedelta that would be added to the default ``origin``.
Those two examples are equivalent for this time series:

.. ipython:: python

ts.resample('17min', origin='start').sum()
ts.resample('17min', offset='23h30min').sum()


Note the use of ``'start'`` for ``origin`` on the last example. In that case, ``origin`` will be set to the first value of the timeseries.

.. _timeseries.periods:

Expand Down
43 changes: 43 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,49 @@ For example:
pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z', utc=True)
pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z')

.. _whatsnew_110.grouper_resample_origin:

Grouper and resample now supports the arguments origin and offset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`Grouper` and :class:`DataFrame.resample` now supports the arguments ``origin`` and ``offset``. It let the user control the timestamp on which to adjust the grouping. (:issue:`31809`)

The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like `30D`) or that divides a day (like `90s` or `1min`). But it can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can now specify a fixed timestamp with the argument ``origin``.
hasB4K marked this conversation as resolved.
Show resolved Hide resolved
hasB4K marked this conversation as resolved.
Show resolved Hide resolved

Two arguments are now deprecated (more information in the documentation of :class:`DataFrame.resample`):

- ``base`` should be replaced by ``offset``.
- ``loffset`` should be replaced by directly adding an offset to the index DataFrame after being resampled.

Small example of the use of ``origin``:

.. ipython:: python

start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
hasB4K marked this conversation as resolved.
Show resolved Hide resolved
middle = '2000-10-02 00:00:00'
rng = pd.date_range(start, end, freq='7min')
ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
jreback marked this conversation as resolved.
Show resolved Hide resolved
ts

Resample with the default behavior ``'start_day'`` (origin is ``2000-10-01 00:00:00``):

.. ipython:: python

ts.resample('17min').sum()
ts.resample('17min', origin='start_day').sum()

Resample using a fixed origin:

.. ipython:: python

ts.resample('17min', origin='epoch').sum()
ts.resample('17min', origin='2000-01-01').sum()

If needed you can adjust the bins with the argument ``offset`` (a Timedelta) that would be added to the default ``origin``.

For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`.


.. _whatsnew_110.enhancements.other:

Other enhancements
Expand Down
10 changes: 10 additions & 0 deletions pandas/_typing.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from datetime import datetime, timedelta
from pathlib import Path
from typing import (
IO,
Expand Down Expand Up @@ -43,6 +44,15 @@
PandasScalar = Union["Period", "Timestamp", "Timedelta", "Interval"]
Scalar = Union[PythonScalar, PandasScalar]

# timestamp and timedelta convertible types

TimestampConvertibleTypes = Union[
"Timestamp", datetime, np.datetime64, int, np.int64, float, str
]
TimedeltaConvertibleTypes = Union[
"Timedelta", timedelta, np.timedelta64, int, np.int64, float, str
]

# other

Dtype = Union[
Expand Down
115 changes: 113 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@
Label,
Level,
Renamer,
TimedeltaConvertibleTypes,
TimestampConvertibleTypes,
ValueKeyFunc,
)
from pandas.compat import set_function_name
Expand Down Expand Up @@ -7760,9 +7762,11 @@ def resample(
convention: str = "start",
kind: Optional[str] = None,
loffset=None,
base: int = 0,
base: Optional[int] = None,
on=None,
level=None,
origin: Union[str, TimestampConvertibleTypes] = "start_day",
offset: Optional[TimedeltaConvertibleTypes] = None,
) -> "Resampler":
"""
Resample time-series data.
Expand Down Expand Up @@ -7797,17 +7801,40 @@ def resample(
By default the input representation is retained.
loffset : timedelta, default None
Adjust the resampled time labels.

.. deprecated:: 1.1.0
You should add the loffset to the `df.index` after the resample.
See below.

base : int, default 0
For frequencies that evenly subdivide 1 day, the "origin" of the
aggregated intervals. For example, for '5min' frequency, base could
range from 0 through 4. Defaults to 0.

.. deprecated:: 1.1.0
The new arguments that you should use are 'offset' or 'origin'.

on : str, optional
For a DataFrame, column to use instead of index for resampling.
Column must be datetime-like.

level : str or int, optional
For a MultiIndex, level (name or number) to use for
resampling. `level` must be datetime-like.
origin : {'epoch', 'start', 'start_day'}, Timestamp or str, default 'start_day'
The timestamp on which to adjust the grouping. The timezone of origin
must match the timezone of the index.
If a timestamp is not used, these values are also supported:

- 'epoch': `origin` is 1970-01-01
hasB4K marked this conversation as resolved.
Show resolved Hide resolved
- 'start': `origin` is the first value of the timeseries
- 'start_day': `origin` is the first day at midnight of the timeseries

.. versionadded:: 1.1.0

offset : Timedelta or str, default is None
An offset timedelta added to the origin.

.. versionadded:: 1.1.0

Returns
-------
Expand Down Expand Up @@ -8025,6 +8052,88 @@ def resample(
2000-01-02 22 140
2000-01-03 32 150
2000-01-04 36 90

If you want to adjust the start of the bins based on a fixed timestamp:
hasB4K marked this conversation as resolved.
Show resolved Hide resolved

>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7T, dtype: int64

>>> ts.resample('17min').sum()
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, dtype: int64

>>> ts.resample('17min', origin='epoch').sum()
2000-10-01 23:18:00 0
2000-10-01 23:35:00 18
2000-10-01 23:52:00 27
2000-10-02 00:09:00 39
2000-10-02 00:26:00 24
Freq: 17T, dtype: int64

>>> ts.resample('17min', origin='2000-01-01').sum()
2000-10-01 23:24:00 3
2000-10-01 23:41:00 15
2000-10-01 23:58:00 45
2000-10-02 00:15:00 45
Freq: 17T, dtype: int64

If you want to adjust the start of the bins with an `offset` Timedelta, the two
following lines are equivalent:

>>> ts.resample('17min', origin='start').sum()
2000-10-01 23:30:00 9
2000-10-01 23:47:00 21
2000-10-02 00:04:00 54
2000-10-02 00:21:00 24
Freq: 17T, dtype: int64

>>> ts.resample('17min', offset='23h30min').sum()
2000-10-01 23:30:00 9
2000-10-01 23:47:00 21
2000-10-02 00:04:00 54
2000-10-02 00:21:00 24
Freq: 17T, dtype: int64

To replace the use of the deprecated `base` argument, you can now use `offset`,
in this example it is equivalent to have `base=2`:

>>> ts.resample('17min', offset='2min').sum()
2000-10-01 23:16:00 0
2000-10-01 23:33:00 9
2000-10-01 23:50:00 36
2000-10-02 00:07:00 39
2000-10-02 00:24:00 24
Freq: 17T, dtype: int64

To replace the use of the deprecated `loffset` argument:

>>> from pandas.tseries.frequencies import to_offset
>>> loffset = '19min'
>>> ts_out = ts.resample('17min').sum()
>>> ts_out.index = ts_out.index + to_offset(loffset)
>>> ts_out
2000-10-01 23:33:00 0
2000-10-01 23:50:00 9
2000-10-02 00:07:00 21
2000-10-02 00:24:00 54
2000-10-02 00:41:00 24
Freq: 17T, dtype: int64
"""
from pandas.core.resample import get_resampler

Expand All @@ -8041,6 +8150,8 @@ def resample(
base=base,
key=on,
level=level,
origin=origin,
offset=offset,
)

def first(self: FrameOrSeries, offset) -> FrameOrSeries:
Expand Down
9 changes: 0 additions & 9 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1646,15 +1646,6 @@ def resample(self, rule, *args, **kwargs):
0 2000-01-01 00:00:00 0 1
2000-01-01 00:03:00 0 2
5 2000-01-01 00:03:00 5 1

Add an offset of twenty seconds.

>>> df.groupby('a').resample('3T', loffset='20s').sum()
a b
a
0 2000-01-01 00:00:20 0 2
2000-01-01 00:03:20 0 1
5 2000-01-01 00:00:20 5 1
"""
from pandas.core.resample import get_resampler_for_grouping

Expand Down
Loading