-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added max_gap keyword for series.interpolate #25141
Conversation
Added numpy-based implementation that searchs for NaN-gaps wider than `maxgap`. In line with the current implementations for NaN handling in `series.interpolate`, a set of NaN-indices that has to be preserved is generated. Test and documentation were also added.
Codecov Report
@@ Coverage Diff @@
## master #25141 +/- ##
===========================================
- Coverage 92.37% 42.84% -49.53%
===========================================
Files 166 166
Lines 52408 52430 +22
===========================================
- Hits 48412 22464 -25948
- Misses 3996 29966 +25970
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25141 +/- ##
=========================================
Coverage ? 41.19%
=========================================
Files ? 178
Lines ? 50799
Branches ? 0
=========================================
Hits ? 20928
Misses ? 29871
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a couple of examples to the doc-string so its easy to see what this is doing, include using limit and not.
@jreback Just a quick comment to let you know that updating this PR is still on my TODO list, but I have not had time to work on it for the last two weeks. Next week looks better, at least judging from now... |
can you merge master |
Resolved the tiny merge conflict. Still no other progress with the PR. Sorry. I am too overloaded till early April. |
@cchwala if you have time; this is a nice patch. pls update to comments. |
@pandas-dev/pandas-core if someone wants to take this over the line, pls merge master and update to comments. |
@jreback Sorry for the long silence. I will do it today. |
For method='pad' the `max_gap` keyword does not seem to have an effect.
There is a problem. For From a first quick search the cause might lie here: pandas/pandas/core/internals/blocks.py Lines 1096 to 1116 in b8ad9da
There are two different pathways for interpolate depending on the selected method . For pad , ffill and bfill missing.interpolate_2d is used which does not yet support the max_gap option. It also does not seem to recognize the keywords limit_direction and limit_area , they are silently ignored.
|
@cchwala is this still active? Can you merge master |
I resolved the merge conflict. However, I still have not found the time to work further on this PR, in particular because it somehow is blocked by #26796. It remains somewhere in the middle of my TODO list... |
FYI, I am offline till 7th of October but plan to continue with this PR afterwards |
@cchwala is this still active? |
# Conflicts: # pandas/core/internals/blocks.py # pandas/core/missing.py
* added example * optimized existing text
* limit_direction was not considered before when max_gap was provided * test have been adjusted for the new correct behavior and additional ones have been added
@TomAugspurger @WillAyd This is now finally ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot going on here, sorry only got through part of it at a glance.
Is there any chance that the changes to allow limit_area
and limit_direction
in Series.interpolate with pad
can be split into its own PR? Would that make this one much smaller?
if (method == "pad") or (method == "ffill"): | ||
if (limit_direction == "backward") or (limit_direction == "both"): | ||
raise ValueError( | ||
"`limit_direction` must not be `%s` for method `%s`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use f-strings for this and the one on L 7140
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
|
||
if max_gap is not None: | ||
|
||
def bfill_nan(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit to making this a separate closure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no real reason. Maybe at some point I thought the function definition would make things clearer.
Should I just put the content of the function in-line starting at L350?
# convert float back to datetime64 | ||
values = values.astype(orig_values.dtype) | ||
|
||
# if np.issubdtype(values.dtype, np.datetime64): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this is commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just forgot to remove it. This stems from when I tried to manually chose the correct fill_value. But getting the correct fill_value is handled here
For making If we would want to split this PR up, probably most changes of this PR would move to the one to solve the issue with |
@cchwala can you merge master and fix up the CI error for code checks? |
Closing to clean queue - I think can revisit after #31048 |
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR introduces the new keyword
max_gap
forinterpolate
. For all NaN-gaps which are wider thanmax_gap
no interpolation is carried out and all NaNs of the gap are preserved. This is in contrast to using thelimit
kwarg which does not prevent interpolating values in a longer NaN gap.I added numpy-based implementation that searches for NaN-gaps wider than
max_gap
. In line with the current implementations for NaN handling inseries.interpolate
, a set of NaN-indices that hasto be preserved is generated. This is used in the end, after a full interpolation of all NaN is done, to restore the NaNs gaps that shall not be interpolated.
Test and documentation were also added.
It will need some small PEP8-cleanup and maybe tests using other interpolation methods then
linear
(edit: Done). But before I continue, I would like to get feedback if my approach is in general okay.This PR might also be extended to close #16457 which is on interpolation directly after resampling.
Example usage:
Timing:
The relative speed difference is similar for larger Series.