Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: remove use of Python sets for interpolate #34727

Conversation

simonjayhawkins
Copy link
Member

have been investigating avoiding internals for interpolation. xref #34628

This PR address the issue of using Python sets, which is responsible for the bulk of the time in our current asv.

There are other improvements (will raise other PRs), but would need new benchmarks to show a benefit such as different index types and unsorted indexes. so this PR is targeting our current benchmark first.

prelim results (will post asv results if tests pass)

N = 10000
# this is the worst case, where every column has NaNs.
df = pd.DataFrame(np.random.randn(N, 100))
df.values[::2] = np.nan
df
%timeit df.interpolate()
# 189 ms ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
# 65.1 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- PR

@simonjayhawkins simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jun 12, 2020
@simonjayhawkins
Copy link
Member Author

some environments are failing with TypeError: diff() got an unexpected keyword argument 'prepend'

else:
# both directions... just use _interp_limit
preserve_nans = set(_interp_limit(invalid, limit, limit))
nans_to_interpolate = _interp_limit(invalid, limit, limit, first, last)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could move the above logic into _interp_limit and fastpath if limit_area.

although the algorithm could probably be extended to 2d, in which case the limit_area logic below would also need to move.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although the algorithm could probably be extended to 2d, in which case the limit_area logic below would also need to move.

i don't think a 2d version will be needed as interpolate_2d will also be applied along axis when max_gap is added.

we could move the above logic into _interp_limit and fastpath if limit_area.

probably best as a follow-on to keep the diff here smaller

strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def inner(arr, limit):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will move this into a top level function called ffill_mask_with_limit (in this PR)

and plan to take the max_gap logic from #25141 and have a analogous ffill_mask_with_max_gap (separate PR)

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

looks fine.

@jreback
Copy link
Contributor

jreback commented Jun 16, 2020

do we have sufficient asv's for this?

@simonjayhawkins
Copy link
Member Author

will get back to this after 1.0.5 we don't have many asvs so could probably add more The current asv is a single float block with a sorted RangeIndex and no limit (which can be fastpathed) so not representative of potentially much slower cases

@simonjayhawkins
Copy link
Member Author

I'll close this for now. still don't have access to desktop and running asvs on laptop is painful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants