ENH: Added max_gap keyword for series.interpolate #25141

cchwala · 2019-02-04T14:41:41Z

closes Pandas interpolation enhancement request : specifying the maximum gap to interpolate. #12187 and limit_area and limit_direction do not have an effect when interpolation method is 'pad' #26796
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This PR introduces the new keyword max_gap for interpolate. For all NaN-gaps which are wider than max_gap no interpolation is carried out and all NaNs of the gap are preserved. This is in contrast to using the limit kwarg which does not prevent interpolating values in a longer NaN gap.

I added numpy-based implementation that searches for NaN-gaps wider than max_gap. In line with the current implementations for NaN handling in series.interpolate, a set of NaN-indices that has
to be preserved is generated. This is used in the end, after a full interpolation of all NaN is done, to restore the NaNs gaps that shall not be interpolated.

Test and documentation were also added.

It will need some small PEP8-cleanup and maybe tests using other interpolation methods then linear (edit: Done). But before I continue, I would like to get feedback if my approach is in general okay.

This PR might also be extended to close #16457 which is on interpolation directly after resampling.

Example usage:

In [1]: import numpy as np                                                                                                                                                                                                         

In [2]: import pandas as pd                                                                                                                                                                                                        

In [3]: s = pd.Series([np.nan, 1., np.nan, 2., np.nan, np.nan,  
   ...:                5., np.nan, np.nan, np.nan, -1., np.nan, np.nan])                                                                                                                                                           

# Using the new `max_gap` kwarg
In [4]: s.interpolate(max_gap=2, limit_area='inside')                                                                                                                                                                               
Out[4]: 
0     NaN
1     1.0
2     1.5
3     2.0
4     3.0
5     4.0
6     5.0
7     NaN
8     NaN
9     NaN
10   -1.0
11    NaN
12    NaN
dtype: float64

# Compare to the result when using the existing `limit` kwarg
In [5]: s.interpolate(limit=2, limit_area='inside')                                                                                                                                                                                
Out[5]: 
0     NaN
1     1.0
2     1.5
3     2.0
4     3.0
5     4.0
6     5.0
7     3.5
8     2.0
9     NaN
10   -1.0
11    NaN
12    NaN
dtype: float64

Timing:

In [6]: %timeit s.interpolate(max_gap=2, limit_area='inside')                                                                                                                                                                       
708 µs ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit s.interpolate(limit=2, limit_area='inside')                                                                                                                                                                        
631 µs ± 5.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The relative speed difference is similar for larger Series.

Added numpy-based implementation that searchs for NaN-gaps wider than `maxgap`. In line with the current implementations for NaN handling in `series.interpolate`, a set of NaN-indices that has to be preserved is generated. Test and documentation were also added.

pep8speaks · 2019-02-04T14:41:47Z

Hello @cchwala! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-20 14:00:08 UTC

codecov · 2019-02-04T15:17:38Z

Codecov Report

Merging #25141 into master will decrease coverage by 49.52%.
The diff coverage is 3.57%.

@@             Coverage Diff             @@
##           master   #25141       +/-   ##
===========================================
- Coverage   92.37%   42.84%   -49.53%     
===========================================
  Files         166      166               
  Lines       52408    52430       +22     
===========================================
- Hits        48412    22464    -25948     
- Misses       3996    29966    +25970

Flag	Coverage Δ
#multiple	`?`
#single	`42.84% <3.57%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`39.89% <ø> (-56.74%)`	⬇️
pandas/core/missing.py	`15.65% <3.57%> (-76.92%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
... and 125 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e38d55...b752602. Read the comment docs.

codecov · 2019-02-04T15:17:41Z

Codecov Report

❗ No coverage uploaded for pull request base (master@3c55e1e). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #25141   +/-   ##
=========================================
  Coverage          ?   41.19%           
=========================================
  Files             ?      178           
  Lines             ?    50799           
  Branches          ?        0           
=========================================
  Hits              ?    20928           
  Misses            ?    29871           
  Partials          ?        0

Flag	Coverage Δ
#single	`41.19% <0%> (?)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c55e1e...28b442c. Read the comment docs.

pandas/core/generic.py

pandas/core/missing.py

jreback

can you add a couple of examples to the doc-string so its easy to see what this is doing, include using limit and not.

pandas/core/generic.py

pandas/core/missing.py

pandas/tests/series/test_missing.py

cchwala · 2019-02-20T19:45:48Z

@jreback Just a quick comment to let you know that updating this PR is still on my TODO list, but I have not had time to work on it for the last two weeks. Next week looks better, at least judging from now...

jreback · 2019-03-20T02:02:49Z

can you merge master

cchwala · 2019-03-26T22:38:27Z

Resolved the tiny merge conflict. Still no other progress with the PR. Sorry. I am too overloaded till early April.

jreback · 2019-05-07T01:19:05Z

@cchwala if you have time; this is a nice patch. pls update to comments.

jreback · 2019-06-08T20:30:10Z

@pandas-dev/pandas-core if someone wants to take this over the line, pls merge master and update to comments.

cchwala · 2019-06-11T12:27:42Z

@jreback Sorry for the long silence. I will do it today.

For method='pad' the `max_gap` keyword does not seem to have an effect.

cchwala · 2019-06-11T20:57:42Z

There is a problem. For s.interpolate(method='pad', max_gap=2) the max_gap keyword does not seem to have an effect. I added a test which fails in this case.

From a first quick search the cause might lie here:

pandas/pandas/core/internals/blocks.py

Lines 1096 to 1116 in b8ad9da

    
           if m is not None: 
        
               r = check_int_bool(self, inplace) 
        
               if r is not None: 
        
                   return r 
        
               return self._interpolate_with_fill(method=m, axis=axis, 
        
                                                  inplace=inplace, limit=limit, 
        
                                                  fill_value=fill_value, 
        
                                                  coerce=coerce, 
        
                                                  downcast=downcast) 
        
           # validate the interp method 
        
           m = missing.clean_interp_method(method, **kwargs) 
        
           r = check_int_bool(self, inplace) 
        
           if r is not None: 
        
               return r 
        
           return self._interpolate(method=m, index=index, values=values, 
        
                                    axis=axis, limit=limit, 
        
                                    limit_direction=limit_direction, 
        
                                    limit_area=limit_area, 
        
                                    fill_value=fill_value, inplace=inplace, 
        
                                    downcast=downcast, **kwargs)

There are two different pathways for interpolate depending on the selected method. For pad, ffill and bfill missing.interpolate_2d is used which does not yet support the max_gap option. It also does not seem to recognize the keywords limit_direction and limit_area, they are silently ignored.

WillAyd · 2019-08-28T16:37:44Z

@cchwala is this still active? Can you merge master

cchwala · 2019-08-31T20:49:23Z

I resolved the merge conflict. However, I still have not found the time to work further on this PR, in particular because it somehow is blocked by #26796. It remains somewhere in the middle of my TODO list...

cchwala · 2019-09-18T20:22:13Z

FYI, I am offline till 7th of October but plan to continue with this PR afterwards

WillAyd · 2019-11-07T21:04:36Z

@cchwala is this still active?

# Conflicts: # pandas/core/internals/blocks.py # pandas/core/missing.py

cchwala · 2019-11-13T12:47:52Z

@WillAyd Not really active... But I resolved the merge conflicts on my machine.

Unfortunately I am struggling with ~~#29330~~ a conda problem on my machine, even in a fresh miniconda3 install, so I cannot run the tests. I will push, when I have verified that my merge makes sense.

* added example * optimized existing text

* limit_direction was not considered before when max_gap was provided * test have been adjusted for the new correct behavior and additional ones have been added

cchwala · 2019-11-20T15:07:07Z

@TomAugspurger @WillAyd This is now finally ready for another review.

TomAugspurger

There's a lot going on here, sorry only got through part of it at a glance.

Is there any chance that the changes to allow limit_area and limit_direction in Series.interpolate with pad can be split into its own PR? Would that make this one much smaller?

TomAugspurger · 2019-11-20T17:01:33Z

pandas/core/generic.py

+        if (method == "pad") or (method == "ffill"):
+            if (limit_direction == "backward") or (limit_direction == "both"):
+                raise ValueError(
+                    "`limit_direction` must not be `%s` for method `%s`"


You can use f-strings for this and the one on L 7140

TomAugspurger · 2019-11-20T17:05:10Z

pandas/core/missing.py

+
+    if max_gap is not None:
+
+        def bfill_nan(arr):


What's the benefit to making this a separate closure?

There is no real reason. Maybe at some point I thought the function definition would make things clearer.

Should I just put the content of the function in-line starting at L350?

TomAugspurger · 2019-11-20T17:05:47Z

pandas/core/missing.py

+        # convert float back to datetime64
+        values = values.astype(orig_values.dtype)
+
+    # if np.issubdtype(values.dtype, np.datetime64):


Any reason this is commented out?

I just forgot to remove it. This stems from when I tried to manually chose the correct fill_value. But getting the correct fill_value is handled here

cchwala · 2019-11-20T19:49:00Z

Is there any chance that the changes to allow limit_area and limit_direction in Series.interpolate with pad can be split into its own PR? Would that make this one much smaller?

For making max_gap work with pad I had to introduce missing.interpolate_1d_fill because the existingmissing.interpolate_2d, which pad did always use, is not (easily) extendable. The cause for the wrong behavior of limit_area when using pad was the same, so this was solved somehow as a byproduct.

If we would want to split this PR up, probably most changes of this PR would move to the one to solve the issue with pad and limit_area. So yes, this one would be much smaller, but the other one would be equally large.

WillAyd · 2020-01-03T02:49:52Z

@cchwala can you merge master and fix up the CI error for code checks?

cchwala · 2020-01-15T17:11:18Z

@WillAyd I separated large parts of the changes into #31048 and will continue here afterwards.

WillAyd · 2020-07-29T20:45:01Z

Closing to clean queue - I think can revisit after #31048

minor pep8 fixes

b752602

cchwala changed the title ~~Added maxgap keyword for series.interpolate~~ ENH: Added maxgap keyword for series.interpolate Feb 4, 2019

fixed parameter order

839b11a

TomAugspurger reviewed Feb 4, 2019

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

pandas/core/missing.py Outdated Show resolved Hide resolved

jreback requested changes Feb 6, 2019

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

pandas/core/generic.py Outdated Show resolved Hide resolved

pandas/core/missing.py Outdated Show resolved Hide resolved

pandas/tests/series/test_missing.py Outdated Show resolved Hide resolved

jreback added Missing-data API Design labels Feb 6, 2019

Merge remote-tracking branch 'upstream/master' into interpolate_maxgap

fcdc4e4

Merge remote-tracking branch 'upstream/master' into interpolate_maxgap

20b70b7

cchwala added 5 commits June 11, 2019 17:16

Changed parameter name from maxgap to max_gap

3cb371e

Moved code to derive indices of "NaNs to preserve" in separate function

8c6ff7a

Tests for errors extended and moved to own function

4aaf8dc

added blank lines in docstring as requested

1f0406f

Added test which fails for method='pad'

eaacefd

For method='pad' the `max_gap` keyword does not seem to have an effect.

cchwala mentioned this pull request Jun 11, 2019

limit_area and limit_direction do not have an effect when interpolation method is 'pad' #26796

Closed

cchwala added 2 commits August 30, 2019 15:54

Merge remote-tracking branch 'upstream/master' into interpolate_maxgap

f274d16

manually add black code formating

c72acdb

shoyer changed the title ~~ENH: Added maxgap keyword for series.interpolate~~ ENH: Added max_gap keyword for series.interpolate Sep 19, 2019

Merge remote-tracking branch 'upstream/master' into interpolate_maxgap

5128b9d

# Conflicts: # pandas/core/internals/blocks.py # pandas/core/missing.py

cchwala added 14 commits November 19, 2019 20:23

Additional required adjustments after merge with upstream/master

3c55e1e

Merge remote-tracking branch 'upstream/master' into interpolate_maxgap

f9e4044

Removed test for bug with pad which should be solved in a separate PR

d1bbcd6

removed trailing whitespaces

21b3091

fixed formating for black and flake8

c96c604

updated docstring for interpolat with max_gap

bd84fc9

* added example * optimized existing text

added max_gap info and example to documentation

908ffe5

added info to whatsnew file

380ef7c

flake8

5a1718a

update docs with info on limit_direction and method pad

16755bd

better test for pandas-dev#26796

b58d721

typo, black, flake8

aa58ffa

update to doc

ae16124

fix wrong behavior when combining max_gap and limit_direction

28b442c

* limit_direction was not considered before when max_gap was provided * test have been adjusted for the new correct behavior and additional ones have been added

TomAugspurger reviewed Nov 20, 2019

View reviewed changes

cchwala mentioned this pull request Jan 15, 2020

FIX: fix interpolate with kwarg limit area and limit direction using pad or bfill #31048

Closed

5 tasks

simonjayhawkins mentioned this pull request Jun 14, 2020

PERF: remove use of Python sets for interpolate #34727

Closed

WillAyd closed this Jul 29, 2020

rhkarls mentioned this pull request Sep 15, 2020

BUG: interpolate with limit keyword partially fills gaps larger than limit #36352

Open

3 tasks

joAschauer mentioned this pull request Jul 2, 2021

ENH: DataFrame.interpolate limit to support all-or-none filling #42291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Added max_gap keyword for series.interpolate #25141

ENH: Added max_gap keyword for series.interpolate #25141

cchwala commented Feb 4, 2019 •

edited

Loading

pep8speaks commented Feb 4, 2019 •

edited

Loading

codecov bot commented Feb 4, 2019

codecov bot commented Feb 4, 2019 •

edited

Loading

jreback left a comment

cchwala commented Feb 20, 2019

jreback commented Mar 20, 2019

cchwala commented Mar 26, 2019

jreback commented May 7, 2019

jreback commented Jun 8, 2019

cchwala commented Jun 11, 2019

cchwala commented Jun 11, 2019

WillAyd commented Aug 28, 2019

cchwala commented Aug 31, 2019

cchwala commented Sep 18, 2019

WillAyd commented Nov 7, 2019

cchwala commented Nov 13, 2019 •

edited

Loading

cchwala commented Nov 20, 2019

TomAugspurger left a comment

TomAugspurger Nov 20, 2019

cchwala Nov 20, 2019

TomAugspurger Nov 20, 2019

cchwala Nov 20, 2019

TomAugspurger Nov 20, 2019

cchwala Nov 20, 2019

cchwala commented Nov 20, 2019

WillAyd commented Jan 3, 2020

cchwala commented Jan 15, 2020

WillAyd commented Jul 29, 2020

ENH: Added max_gap keyword for series.interpolate #25141

ENH: Added max_gap keyword for series.interpolate #25141

Conversation

cchwala commented Feb 4, 2019 • edited Loading

pep8speaks commented Feb 4, 2019 • edited Loading

Comment last updated at 2019-11-20 14:00:08 UTC

codecov bot commented Feb 4, 2019

Codecov Report

codecov bot commented Feb 4, 2019 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

cchwala commented Feb 20, 2019

jreback commented Mar 20, 2019

cchwala commented Mar 26, 2019

jreback commented May 7, 2019

jreback commented Jun 8, 2019

cchwala commented Jun 11, 2019

cchwala commented Jun 11, 2019

WillAyd commented Aug 28, 2019

cchwala commented Aug 31, 2019

cchwala commented Sep 18, 2019

WillAyd commented Nov 7, 2019

cchwala commented Nov 13, 2019 • edited Loading

cchwala commented Nov 20, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Nov 20, 2019

Choose a reason for hiding this comment

cchwala Nov 20, 2019

Choose a reason for hiding this comment

TomAugspurger Nov 20, 2019

Choose a reason for hiding this comment

cchwala Nov 20, 2019

Choose a reason for hiding this comment

TomAugspurger Nov 20, 2019

Choose a reason for hiding this comment

cchwala Nov 20, 2019

Choose a reason for hiding this comment

cchwala commented Nov 20, 2019

WillAyd commented Jan 3, 2020

cchwala commented Jan 15, 2020

WillAyd commented Jul 29, 2020

cchwala commented Feb 4, 2019 •

edited

Loading

pep8speaks commented Feb 4, 2019 •

edited

Loading

codecov bot commented Feb 4, 2019 •

edited

Loading

cchwala commented Nov 13, 2019 •

edited

Loading