Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added max_gap keyword for series.interpolate #25141

Closed
wants to merge 36 commits into from

Conversation

cchwala
Copy link
Contributor

@cchwala cchwala commented Feb 4, 2019

This PR introduces the new keyword max_gap for interpolate. For all NaN-gaps which are wider than max_gap no interpolation is carried out and all NaNs of the gap are preserved. This is in contrast to using the limit kwarg which does not prevent interpolating values in a longer NaN gap.

I added numpy-based implementation that searches for NaN-gaps wider than max_gap. In line with the current implementations for NaN handling in series.interpolate, a set of NaN-indices that has
to be preserved is generated. This is used in the end, after a full interpolation of all NaN is done, to restore the NaNs gaps that shall not be interpolated.

Test and documentation were also added.

It will need some small PEP8-cleanup and maybe tests using other interpolation methods then linear (edit: Done). But before I continue, I would like to get feedback if my approach is in general okay.

This PR might also be extended to close #16457 which is on interpolation directly after resampling.

Example usage:

In [1]: import numpy as np                                                                                                                                                                                                         

In [2]: import pandas as pd                                                                                                                                                                                                        

In [3]: s = pd.Series([np.nan, 1., np.nan, 2., np.nan, np.nan,  
   ...:                5., np.nan, np.nan, np.nan, -1., np.nan, np.nan])                                                                                                                                                           

# Using the new `max_gap` kwarg
In [4]: s.interpolate(max_gap=2, limit_area='inside')                                                                                                                                                                               
Out[4]: 
0     NaN
1     1.0
2     1.5
3     2.0
4     3.0
5     4.0
6     5.0
7     NaN
8     NaN
9     NaN
10   -1.0
11    NaN
12    NaN
dtype: float64

# Compare to the result when using the existing `limit` kwarg
In [5]: s.interpolate(limit=2, limit_area='inside')                                                                                                                                                                                
Out[5]: 
0     NaN
1     1.0
2     1.5
3     2.0
4     3.0
5     4.0
6     5.0
7     3.5
8     2.0
9     NaN
10   -1.0
11    NaN
12    NaN
dtype: float64

Timing:

In [6]: %timeit s.interpolate(max_gap=2, limit_area='inside')                                                                                                                                                                       
708 µs ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit s.interpolate(limit=2, limit_area='inside')                                                                                                                                                                        
631 µs ± 5.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The relative speed difference is similar for larger Series.

Added numpy-based implementation that searchs for NaN-gaps wider
than `maxgap`. In line with the current implementations for
NaN handling in `series.interpolate`, a set of NaN-indices that has
to be preserved is generated.

Test and documentation were also added.
@pep8speaks
Copy link

pep8speaks commented Feb 4, 2019

Hello @cchwala! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-20 14:00:08 UTC

@codecov
Copy link

codecov bot commented Feb 4, 2019

Codecov Report

Merging #25141 into master will decrease coverage by 49.52%.
The diff coverage is 3.57%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25141       +/-   ##
===========================================
- Coverage   92.37%   42.84%   -49.53%     
===========================================
  Files         166      166               
  Lines       52408    52430       +22     
===========================================
- Hits        48412    22464    -25948     
- Misses       3996    29966    +25970
Flag Coverage Δ
#multiple ?
#single 42.84% <3.57%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 39.89% <ø> (-56.74%) ⬇️
pandas/core/missing.py 15.65% <3.57%> (-76.92%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
... and 125 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e38d55...b752602. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 4, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@3c55e1e). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #25141   +/-   ##
=========================================
  Coverage          ?   41.19%           
=========================================
  Files             ?      178           
  Lines             ?    50799           
  Branches          ?        0           
=========================================
  Hits              ?    20928           
  Misses            ?    29871           
  Partials          ?        0
Flag Coverage Δ
#single 41.19% <0%> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c55e1e...28b442c. Read the comment docs.

@cchwala cchwala changed the title Added maxgap keyword for series.interpolate ENH: Added maxgap keyword for series.interpolate Feb 4, 2019
pandas/core/generic.py Outdated Show resolved Hide resolved
pandas/core/missing.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a couple of examples to the doc-string so its easy to see what this is doing, include using limit and not.

pandas/core/generic.py Outdated Show resolved Hide resolved
pandas/core/generic.py Outdated Show resolved Hide resolved
pandas/core/missing.py Outdated Show resolved Hide resolved
pandas/tests/series/test_missing.py Outdated Show resolved Hide resolved
@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels Feb 6, 2019
@cchwala
Copy link
Contributor Author

cchwala commented Feb 20, 2019

@jreback Just a quick comment to let you know that updating this PR is still on my TODO list, but I have not had time to work on it for the last two weeks. Next week looks better, at least judging from now...

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

can you merge master

@cchwala
Copy link
Contributor Author

cchwala commented Mar 26, 2019

Resolved the tiny merge conflict. Still no other progress with the PR. Sorry. I am too overloaded till early April.

@jreback
Copy link
Contributor

jreback commented May 7, 2019

@cchwala if you have time; this is a nice patch. pls update to comments.

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

@pandas-dev/pandas-core if someone wants to take this over the line, pls merge master and update to comments.

@cchwala
Copy link
Contributor Author

cchwala commented Jun 11, 2019

@jreback Sorry for the long silence. I will do it today.

@cchwala
Copy link
Contributor Author

cchwala commented Jun 11, 2019

There is a problem. For s.interpolate(method='pad', max_gap=2) the max_gap keyword does not seem to have an effect. I added a test which fails in this case.

From a first quick search the cause might lie here:

if m is not None:
r = check_int_bool(self, inplace)
if r is not None:
return r
return self._interpolate_with_fill(method=m, axis=axis,
inplace=inplace, limit=limit,
fill_value=fill_value,
coerce=coerce,
downcast=downcast)
# validate the interp method
m = missing.clean_interp_method(method, **kwargs)
r = check_int_bool(self, inplace)
if r is not None:
return r
return self._interpolate(method=m, index=index, values=values,
axis=axis, limit=limit,
limit_direction=limit_direction,
limit_area=limit_area,
fill_value=fill_value, inplace=inplace,
downcast=downcast, **kwargs)

There are two different pathways for interpolate depending on the selected method. For pad, ffill and bfill missing.interpolate_2d is used which does not yet support the max_gap option. It also does not seem to recognize the keywords limit_direction and limit_area, they are silently ignored.

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

@cchwala is this still active? Can you merge master

@cchwala
Copy link
Contributor Author

cchwala commented Aug 31, 2019

I resolved the merge conflict. However, I still have not found the time to work further on this PR, in particular because it somehow is blocked by #26796. It remains somewhere in the middle of my TODO list...

@cchwala
Copy link
Contributor Author

cchwala commented Sep 18, 2019

FYI, I am offline till 7th of October but plan to continue with this PR afterwards

@shoyer shoyer changed the title ENH: Added maxgap keyword for series.interpolate ENH: Added max_gap keyword for series.interpolate Sep 19, 2019
@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2019

@cchwala is this still active?

# Conflicts:
#	pandas/core/internals/blocks.py
#	pandas/core/missing.py
@cchwala
Copy link
Contributor Author

cchwala commented Nov 13, 2019

@WillAyd Not really active... But I resolved the merge conflicts on my machine.

Unfortunately I am struggling with #29330 a conda problem on my machine, even in a fresh miniconda3 install, so I cannot run the tests. I will push, when I have verified that my merge makes sense.

@cchwala
Copy link
Contributor Author

cchwala commented Nov 20, 2019

@TomAugspurger @WillAyd This is now finally ready for another review.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot going on here, sorry only got through part of it at a glance.

Is there any chance that the changes to allow limit_area and limit_direction in Series.interpolate with pad can be split into its own PR? Would that make this one much smaller?

if (method == "pad") or (method == "ffill"):
if (limit_direction == "backward") or (limit_direction == "both"):
raise ValueError(
"`limit_direction` must not be `%s` for method `%s`"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use f-strings for this and the one on L 7140

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay


if max_gap is not None:

def bfill_nan(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit to making this a separate closure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real reason. Maybe at some point I thought the function definition would make things clearer.

Should I just put the content of the function in-line starting at L350?

# convert float back to datetime64
values = values.astype(orig_values.dtype)

# if np.issubdtype(values.dtype, np.datetime64):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just forgot to remove it. This stems from when I tried to manually chose the correct fill_value. But getting the correct fill_value is handled here

@cchwala
Copy link
Contributor Author

cchwala commented Nov 20, 2019

Is there any chance that the changes to allow limit_area and limit_direction in Series.interpolate with pad can be split into its own PR? Would that make this one much smaller?

For making max_gap work with pad I had to introduce missing.interpolate_1d_fill because the existingmissing.interpolate_2d, which pad did always use, is not (easily) extendable. The cause for the wrong behavior of limit_area when using pad was the same, so this was solved somehow as a byproduct.

If we would want to split this PR up, probably most changes of this PR would move to the one to solve the issue with pad and limit_area. So yes, this one would be much smaller, but the other one would be equally large.

@WillAyd
Copy link
Member

WillAyd commented Jan 3, 2020

@cchwala can you merge master and fix up the CI error for code checks?

@cchwala
Copy link
Contributor Author

cchwala commented Jan 15, 2020

@WillAyd I separated large parts of the changes into #31048 and will continue here afterwards.

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2020

Closing to clean queue - I think can revisit after #31048

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
6 participants