[FEA] Time series resampling (df.resample) #8416

beckernick · 2021-06-01T13:38:59Z

In pandas, I can "resample" time series data (converting the frequency or up/downsampling the data while keeping track of the associated values) with the convenient resample API. This feature request comes care of this stackoverflow post.

In the following example, per-minute data is aggregated into three minute bins and the associated values are summed.

import pandas as pd

index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
print(series, "\n")
print(series.resample('3T').sum())
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64 

2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

shwina · 2021-06-01T19:02:36Z

There has been some related discussion in #6255.

shwina · 2021-06-02T20:37:43Z

Just curious: does Spark have similar resampling functionality? Google says "no", but just thought to check. cc: @revans2 @jlowe

jlowe · 2021-06-03T14:28:33Z

does Spark have similar resampling functionality?

No, Spark does not have built-in resampling like this.

Closes #6255, #8416 This PR implements two related features: 1. Grouping by a frequency via the `freq=` argument to `cudf.Grouper` 2. and time-series resampling via the `.resample()` API Either operation results in ` _Resampler` object that represents the data resampled into "bins" of a particular frequency. The following operations are supported on resampled data: 1. Aggregations such as `min()` and `max()`, performed bin-wise 2. `ffill()` and `bfill()` methods: forward and backward filling in the case of upsampling data 3. `asfreq()`: returns the resampled data as a Series or DataFrame() These are all best understood by example: First, we create a time series with 1 minute intervals: ```python >>> index = cudf.date_range(start="2001-01-01", periods=10, freq="1T") >>> sr = cudf.Series(range(10), index=index) >>> sr 2001-01-01 00:00:00 0 2001-01-01 00:01:00 1 2001-01-01 00:02:00 2 2001-01-01 00:03:00 3 2001-01-01 00:04:00 4 2001-01-01 00:05:00 5 2001-01-01 00:06:00 6 2001-01-01 00:07:00 7 2001-01-01 00:08:00 8 2001-01-01 00:09:00 9 dtype: int64 ```` Downsampling to 3 minute intervals, followed by a "sum" aggregation: ```python >>> sr.resample("3T").sum() # equivalently, sr.groupby(cudf.Grouper(freq="3T")).sum() 2001-01-01 00:00:00 3 2001-01-01 00:03:00 12 2001-01-01 00:06:00 21 2001-01-01 00:09:00 9 dtype: int64 ```` Upsampling to 30 second intervals: ```python >>> sr.resample("30s").asfreq() 2001-01-01 00:00:00 0.0 2001-01-01 00:00:30 NaN 2001-01-01 00:01:00 1.0 2001-01-01 00:01:30 NaN 2001-01-01 00:02:00 2.0 2001-01-01 00:02:30 NaN 2001-01-01 00:03:00 3.0 2001-01-01 00:03:30 NaN 2001-01-01 00:04:00 4.0 2001-01-01 00:04:30 NaN 2001-01-01 00:05:00 5.0 2001-01-01 00:05:30 NaN 2001-01-01 00:06:00 6.0 2001-01-01 00:06:30 NaN 2001-01-01 00:07:00 7.0 2001-01-01 00:07:30 NaN 2001-01-01 00:08:00 8.0 2001-01-01 00:08:30 NaN 2001-01-01 00:09:00 9.0 Freq: 30S, dtype: float64 ``` Upsampling to 30 second intervals, followed by a forward fill: ```python >>> sr.resample("30s").ffill() 2001-01-01 00:00:00 0 2001-01-01 00:00:30 0 2001-01-01 00:01:00 1 2001-01-01 00:01:30 1 2001-01-01 00:02:00 2 2001-01-01 00:02:30 2 2001-01-01 00:03:00 3 2001-01-01 00:03:30 3 2001-01-01 00:04:00 4 2001-01-01 00:04:30 4 2001-01-01 00:05:00 5 2001-01-01 00:05:30 5 2001-01-01 00:06:00 6 2001-01-01 00:06:30 6 2001-01-01 00:07:00 7 2001-01-01 00:07:30 7 2001-01-01 00:08:00 8 2001-01-01 00:08:30 8 2001-01-01 00:09:00 9 Freq: 30S, dtype: int64 ``` Authors: - Ashwin Srinath (https://github.com/shwina) - Michael Wang (https://github.com/isVoid) Approvers: - https://github.com/brandon-b-miller - Vyas Ramasubramani (https://github.com/vyasr) - Benjamin Zaitlen (https://github.com/quasiben) URL: #9178

beckernick added feature request New feature or request Needs Triage Need team to review and classify labels Jun 1, 2021

shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 1, 2021

beckernick added this to the Time Series Analysis milestone Jul 14, 2021

shwina self-assigned this Jul 15, 2021

shwina mentioned this issue Sep 3, 2021

Grouping by frequency and resampling #9178

Merged

galipremsagar linked a pull request Nov 10, 2021 that will close this issue

Grouping by frequency and resampling #9178

Merged

rapids-bot bot closed this as completed in #9178 Nov 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Time series resampling (df.resample) #8416

[FEA] Time series resampling (df.resample) #8416

beckernick commented Jun 1, 2021 •

edited

Loading

shwina commented Jun 1, 2021

shwina commented Jun 2, 2021

jlowe commented Jun 3, 2021

[FEA] Time series resampling (df.resample) #8416

[FEA] Time series resampling (df.resample) #8416

Comments

beckernick commented Jun 1, 2021 • edited Loading

shwina commented Jun 1, 2021

shwina commented Jun 2, 2021

jlowe commented Jun 3, 2021

beckernick commented Jun 1, 2021 •

edited

Loading