Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Time series resampling (df.resample) #8416

Closed
beckernick opened this issue Jun 1, 2021 · 3 comments · Fixed by #9178
Closed

[FEA] Time series resampling (df.resample) #8416

beckernick opened this issue Jun 1, 2021 · 3 comments · Fixed by #9178
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jun 1, 2021

In pandas, I can "resample" time series data (converting the frequency or up/downsampling the data while keeping track of the associated values) with the convenient resample API. This feature request comes care of this stackoverflow post.

In the following example, per-minute data is aggregated into three minute bins and the associated values are summed.

import pandas as pdindex = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
print(series, "\n")
print(series.resample('3T').sum())
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64 

2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64
@beckernick beckernick added feature request New feature or request Needs Triage Need team to review and classify labels Jun 1, 2021
@shwina shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 1, 2021
@shwina
Copy link
Contributor

shwina commented Jun 1, 2021

There has been some related discussion in #6255.

@shwina
Copy link
Contributor

shwina commented Jun 2, 2021

Just curious: does Spark have similar resampling functionality? Google says "no", but just thought to check. cc: @revans2 @jlowe

@jlowe
Copy link
Contributor

jlowe commented Jun 3, 2021

does Spark have similar resampling functionality?

No, Spark does not have built-in resampling like this.

@beckernick beckernick added this to the Time Series Analysis milestone Jul 14, 2021
@shwina shwina self-assigned this Jul 15, 2021
@galipremsagar galipremsagar linked a pull request Nov 10, 2021 that will close this issue
rapids-bot bot pushed a commit that referenced this issue Nov 13, 2021
Closes #6255, #8416 

This PR implements two related features:

1. Grouping by a frequency via the `freq=` argument to `cudf.Grouper`
2. and time-series resampling via the `.resample()` API

Either operation results in ` _Resampler` object that represents the data resampled into "bins" of a particular frequency. The following operations are supported on resampled data:

1. Aggregations such as `min()` and `max()`, performed bin-wise
2. `ffill()` and `bfill()` methods: forward and backward filling in the case of upsampling data
3. `asfreq()`: returns the resampled data as a Series or DataFrame()

These are all best understood by example:

First, we create a time series with 1 minute intervals:

```python
>>> index = cudf.date_range(start="2001-01-01", periods=10, freq="1T")
>>> sr = cudf.Series(range(10), index=index)
>>> sr
2001-01-01 00:00:00    0
2001-01-01 00:01:00    1
2001-01-01 00:02:00    2
2001-01-01 00:03:00    3
2001-01-01 00:04:00    4
2001-01-01 00:05:00    5
2001-01-01 00:06:00    6
2001-01-01 00:07:00    7
2001-01-01 00:08:00    8
2001-01-01 00:09:00    9
dtype: int64
````
Downsampling to 3 minute intervals, followed by a "sum" aggregation:

```python
>>> sr.resample("3T").sum()  # equivalently, sr.groupby(cudf.Grouper(freq="3T")).sum()
2001-01-01 00:00:00     3
2001-01-01 00:03:00    12
2001-01-01 00:06:00    21
2001-01-01 00:09:00     9
dtype: int64
````

Upsampling to 30 second intervals:

```python
>>> sr.resample("30s").asfreq()
2001-01-01 00:00:00    0.0
2001-01-01 00:00:30    NaN
2001-01-01 00:01:00    1.0
2001-01-01 00:01:30    NaN
2001-01-01 00:02:00    2.0
2001-01-01 00:02:30    NaN
2001-01-01 00:03:00    3.0
2001-01-01 00:03:30    NaN
2001-01-01 00:04:00    4.0
2001-01-01 00:04:30    NaN
2001-01-01 00:05:00    5.0
2001-01-01 00:05:30    NaN
2001-01-01 00:06:00    6.0
2001-01-01 00:06:30    NaN
2001-01-01 00:07:00    7.0
2001-01-01 00:07:30    NaN
2001-01-01 00:08:00    8.0
2001-01-01 00:08:30    NaN
2001-01-01 00:09:00    9.0
Freq: 30S, dtype: float64
```

Upsampling to 30 second intervals, followed by a forward fill:


```python
>>> sr.resample("30s").ffill()
2001-01-01 00:00:00    0
2001-01-01 00:00:30    0
2001-01-01 00:01:00    1
2001-01-01 00:01:30    1
2001-01-01 00:02:00    2
2001-01-01 00:02:30    2
2001-01-01 00:03:00    3
2001-01-01 00:03:30    3
2001-01-01 00:04:00    4
2001-01-01 00:04:30    4
2001-01-01 00:05:00    5
2001-01-01 00:05:30    5
2001-01-01 00:06:00    6
2001-01-01 00:06:30    6
2001-01-01 00:07:00    7
2001-01-01 00:07:30    7
2001-01-01 00:08:00    8
2001-01-01 00:08:30    8
2001-01-01 00:09:00    9
Freq: 30S, dtype: int64
```

Authors:
  - Ashwin Srinath (https://github.com/shwina)
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - https://github.com/brandon-b-miller
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #9178
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants