Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: new interpolate kwarg to skip gaps #16457

Open
naifrec opened this issue May 23, 2017 · 0 comments
Open

ENH: new interpolate kwarg to skip gaps #16457

naifrec opened this issue May 23, 2017 · 0 comments
Labels
Datetime Datetime data dtype Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method

Comments

@naifrec
Copy link

naifrec commented May 23, 2017

hey guys! I have the feeling the limit kwarg does not behave as you would expect it to when working with time series. To cite @rhkarls in the issue #1892 :

Say limit=2, if there is a NaN gap of 2 it would be completely filled with interpolated values. If there is a NaN gap of 4 nothing is filled, which is different from the fillna limit where the two first entries would be filled when using forward filling. This is very applicable for time series where it is often valid to interpolate between small gaps, while larger gaps should not be filled.

So lemme write an example:

import pandas as pd


df = pd.DataFrame(
    index=pd.date_range(
        start='02-01-2017 06:00:00',
        end='02-07-2017 06:00:00'),
    data={'A': range(7)})
df = df.drop(pd.to_datetime('2017-02-02 06:00:00'), axis=0)

df.head()

                     A
2017-02-01 06:00:00  0
2017-02-03 06:00:00  2
2017-02-04 06:00:00  3
2017-02-05 06:00:00  4
2017-02-06 06:00:00  5

Now what I want is to resample and interpolate the time series every 12 hours, but only for the consecutive days, so as not to make too big assumptions on the behavior of the time series for larger time deltas. That is not immediately possible currently, because of how limit works. See below, where putting limit of 2 (i.e. limit of a day) means that if two consecutive values are NaN, please do not fill in:

df.resample(rule='12H',base=6).interpolate('time', limit=2)

                       A
2017-02-01 06:00:00  0.0
2017-02-01 18:00:00  0.5  # I would expect this to be NaN
2017-02-02 06:00:00  1.0  # I would expect this to be NaN
2017-02-02 18:00:00  NaN
2017-02-03 06:00:00  2.0
2017-02-03 18:00:00  2.5
2017-02-04 06:00:00  3.0
2017-02-04 18:00:00  3.5
2017-02-05 06:00:00  4.0
2017-02-05 18:00:00  4.5
2017-02-06 06:00:00  5.0
2017-02-06 18:00:00  5.5
2017-02-07 06:00:00  6.0
In [ ]:

To achieve what I want now, I have to use these functions I made:

def interpolate_consecutive(df, frequency):
    """
    Only interpolates value at the frequency asked if the
    values where separated by a day.
    
    Paramteres
    ----------
    df : pd.DataFrame
        Dataframe with Time series index
    frequency : basestring
        Frequency to use to resample then interpolate.
        Only expects 'H' or 'T' based rules, but that's
        because I only need to support these in my case.
    
    Returns
    -------
    df : pd.DataFrame
        Resampled and interpolated dataframe.

    """
    base = 6 if 'H' in frequency else 0
    start_indices, end_indices = get_non_consecutive(
        df, pd.Timedelta(days=1))
    df = df.resample(rule=frequency, base=base).interpolate('time')

    indices_to_drop = []
    for start_date, end_date in zip(start_indices, end_indices):
        indices_to_drop.extend(list(df.index[
            np.logical_and(start_date < df.index,
                           df.index < end_date)]))
    df.drop(indices_to_drop, axis=0, inplace=True)
    return df


def get_non_consecutive(df, timedelta):
    """
    Get the tuple start_indices, end_indices of all
    non consecutive period in the dataframe index.
    Two timestamps separated with more than timedelta
    are considered non consecutive.
    
    Parameters
    ----------
    df : pandas.DataFrame
        Dataframe with Time series index
    timedelta : pd.Timedelta
        Time delta.
    
    Returns
    -------
    start_dates : array-like
        List of start dates of non consecutive periods
    end_dates : array-like
        List of end dates of non consecutive periods

    """
    where = np.where(
        df.index[1:] - df.index[:-1] > timedelta)[0]
    return df.index[where], df.index[where + 1]

using these function I now get my desired output:

interpolate_consecutive(df, '12H')

                       A
2017-02-01 06:00:00  0.0
2017-02-03 06:00:00  2.0
2017-02-03 18:00:00  2.5
2017-02-04 06:00:00  3.0
2017-02-04 18:00:00  3.5
2017-02-05 06:00:00  4.0
2017-02-05 18:00:00  4.5
2017-02-06 06:00:00  5.0
2017-02-06 18:00:00  5.5
2017-02-07 06:00:00  6.0

tldr, limit should actually not always do forward filling, but check the length of the NaN gap and not fill in anything if this gap is longer than the limit.

Thank you for taking the time to read this, hope I made myself clear.

@naifrec naifrec changed the title Enhancement: new interpolate kwarg to skip gaps ENH: new interpolate kwarg to skip gaps May 23, 2017
@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method Datetime Datetime data dtype Difficulty Intermediate Enhancement labels May 25, 2017
@jreback jreback added this to the Next Major Release milestone May 25, 2017
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants