[FEA] Linear interpolation of missing values #8685

beckernick · 2021-07-07T22:03:07Z

In time series analyses, I may have missing values throughout my data that I would like to fill. In the general case, I can use Series.fillna to fill missing values with things like a scalar, the preceding valid value (forward fill), or the next valid value (backward fill).

When my data has a temporal trend, the effectiveness of these techniques can break down. In such situations, I often want to interpolate between the valid values to replace my missing values with a statistical approach rather than fill missing values with a single scalar, forward, or backward fill.

Pandas provides such functionality via an interpolate API that delegates to numpy.interp and the scipy.interpolate sub-package to support a variety of interpolation techniques (including the standard forward fill described above).

I'd like to be able to able to conduct linear interpolation like I can do in pandas. For linear interpolation, pandas delegates to numpy.interp which has an corresponding implementation in cupy (meaning this may be possible in the Python layer)

import pandas as pd

s = pd.Series([0, 2, None, None, None, 8])
s.interpolate(method='linear')
0    0.0
1    2.0
2    3.5
3    5.0
4    6.5
5    8.0
dtype: float64

The text was updated successfully, but these errors were encountered:

Adds Series and DataFrame level functions for linear interpolation of missing values, built around CuPy's `interp` method. Pandas `interpolate` API supports somewhat varied functionality for filling `NaN`s. It currently does not work for actual `<NA>` values - pandas issue [here.](pandas-dev/pandas#40252). That said one might expect both kinds of missing data to be treated equally for the purposes of interpolation, and this PR does that. While `cp.interp` is great for getting us off the ground, but only supports linear interpolation and its results aren't exactly what pandas produces. In particular pandas will not fill `NaN`s at the start of the series, because the default value of `limit_direction` is `forward` and the default `limit` is `None` which from my experimentation means 'unlimited'. This means that that despite this, the `NaN`s at the end WILL get filled. This means we need to actually figure out where the first NaN is and mask out that part of the series with `NaN`s. Closes #8685. Authors: - https://github.com/brandon-b-miller Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) URL: #8767

beckernick added feature request New feature or request Python Affects Python cuDF API. labels Jul 7, 2021

brandon-b-miller self-assigned this Jul 12, 2021

beckernick added this to the Time Series Analysis milestone Jul 14, 2021

brandon-b-miller mentioned this issue Jul 19, 2021

Linear Interpolation of nans via cupy #8767

Merged

rapids-bot bot closed this as completed in #8767 Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Linear interpolation of missing values #8685

[FEA] Linear interpolation of missing values #8685

beckernick commented Jul 7, 2021 •

edited

Loading

[FEA] Linear interpolation of missing values #8685

[FEA] Linear interpolation of missing values #8685

Comments

beckernick commented Jul 7, 2021 • edited Loading

beckernick commented Jul 7, 2021 •

edited

Loading