Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Linear interpolation of missing values #8685

Closed
beckernick opened this issue Jul 7, 2021 · 0 comments · Fixed by #8767
Closed

[FEA] Linear interpolation of missing values #8685

beckernick opened this issue Jul 7, 2021 · 0 comments · Fixed by #8767
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jul 7, 2021

In time series analyses, I may have missing values throughout my data that I would like to fill. In the general case, I can use Series.fillna to fill missing values with things like a scalar, the preceding valid value (forward fill), or the next valid value (backward fill).

When my data has a temporal trend, the effectiveness of these techniques can break down. In such situations, I often want to interpolate between the valid values to replace my missing values with a statistical approach rather than fill missing values with a single scalar, forward, or backward fill.

Pandas provides such functionality via an interpolate API that delegates to numpy.interp and the scipy.interpolate sub-package to support a variety of interpolation techniques (including the standard forward fill described above).

I'd like to be able to able to conduct linear interpolation like I can do in pandas. For linear interpolation, pandas delegates to numpy.interp which has an corresponding implementation in cupy (meaning this may be possible in the Python layer)

import pandas as pds = pd.Series([0, 2, None, None, None, 8])
s.interpolate(method='linear')
0    0.0
1    2.0
2    3.5
3    5.0
4    6.5
5    8.0
dtype: float64
@beckernick beckernick added feature request New feature or request Python Affects Python cuDF API. labels Jul 7, 2021
@brandon-b-miller brandon-b-miller self-assigned this Jul 12, 2021
@beckernick beckernick added this to the Time Series Analysis milestone Jul 14, 2021
rapids-bot bot pushed a commit that referenced this issue Aug 10, 2021
Adds Series and DataFrame level functions for linear interpolation of missing values, built around CuPy's `interp` method. 

Pandas `interpolate` API supports somewhat varied functionality for filling `NaN`s. It currently does not work for actual `<NA>` values - pandas issue [here.](pandas-dev/pandas#40252). That said one might expect both kinds of missing data to be treated equally for the purposes of interpolation, and this PR does that. 

While `cp.interp` is great for getting us off the ground, but only supports linear interpolation and its results aren't exactly what pandas produces. In particular pandas will not fill `NaN`s at the start of the series, because the default value of `limit_direction` is `forward` and the default `limit` is `None` which from my experimentation means 'unlimited'. This means that that despite this, the `NaN`s at the end WILL get filled. This means we need to actually figure out where the first NaN is and mask out that part of the series with `NaN`s. 

Closes #8685.

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ashwin Srinath (https://github.com/shwina)

URL: #8767
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants