PERF: remove use of Python sets for interpolate #34727

simonjayhawkins · 2020-06-12T07:46:31Z

have been investigating avoiding internals for interpolation. xref #34628

This PR address the issue of using Python sets, which is responsible for the bulk of the time in our current asv.

There are other improvements (will raise other PRs), but would need new benchmarks to show a benefit such as different index types and unsorted indexes. so this PR is targeting our current benchmark first.

prelim results (will post asv results if tests pass)

N = 10000
# this is the worst case, where every column has NaNs.
df = pd.DataFrame(np.random.randn(N, 100))
df.values[::2] = np.nan
df
%timeit df.interpolate()
# 189 ms ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
# 65.1 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- PR

simonjayhawkins · 2020-06-12T08:23:06Z

some environments are failing with TypeError: diff() got an unexpected keyword argument 'prepend'

pandas/core/missing.py

simonjayhawkins · 2020-06-12T15:18:21Z

pandas/core/missing.py

    else:
        # both directions... just use _interp_limit
-        preserve_nans = set(_interp_limit(invalid, limit, limit))
+        nans_to_interpolate = _interp_limit(invalid, limit, limit, first, last)


we could move the above logic into _interp_limit and fastpath if limit_area.

although the algorithm could probably be extended to 2d, in which case the limit_area logic below would also need to move.

although the algorithm could probably be extended to 2d, in which case the limit_area logic below would also need to move.

i don't think a 2d version will be needed as interpolate_2d will also be applied along axis when max_gap is added.

we could move the above logic into _interp_limit and fastpath if limit_area.

probably best as a follow-on to keep the diff here smaller

…sets-perf

simonjayhawkins · 2020-06-14T18:41:33Z

pandas/core/missing.py

-    strides = a.strides + (a.strides[-1],)
-    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
+
+    def inner(arr, limit):


will move this into a top level function called ffill_mask_with_limit (in this PR)

and plan to take the max_gap logic from #25141 and have a analogous ffill_mask_with_max_gap (separate PR)

jreback · 2020-06-14T20:45:50Z

looks fine.

jreback · 2020-06-16T13:08:23Z

do we have sufficient asv's for this?

simonjayhawkins · 2020-06-16T13:44:21Z

will get back to this after 1.0.5 we don't have many asvs so could probably add more The current asv is a single float block with a sorted RangeIndex and no limit (which can be fastpathed) so not representative of potentially much slower cases

simonjayhawkins · 2020-07-24T16:22:42Z

I'll close this for now. still don't have access to desktop and running asvs on laptop is painful.

PERF: remove use of Python sets for interpolate

f0485a1

simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jun 12, 2020

simonjayhawkins mentioned this pull request Jun 12, 2020

PERF: interpolate_1d returns function to apply columnwise #34728

Closed

add implementation notes

e707c3d

simonjayhawkins commented Jun 12, 2020

View reviewed changes

pandas/core/missing.py Outdated Show resolved Hide resolved

simonjayhawkins commented Jun 12, 2020

View reviewed changes

simonjayhawkins added 3 commits June 12, 2020 17:01

avoid passing first and last to _interp_limit

1702ef5

Merge remote-tracking branch 'upstream/master' into interpolate---no-…

0614598

…sets-perf

update for older numpy

0bcccb7

simonjayhawkins commented Jun 14, 2020

View reviewed changes

older numpy

04fc8cb

simonjayhawkins mentioned this pull request Jun 23, 2020

BUG: pandas.DataFrame.interpolate fails with high value of limit argument #34936

Open

3 tasks

simonjayhawkins closed this Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: remove use of Python sets for interpolate #34727

PERF: remove use of Python sets for interpolate #34727

simonjayhawkins commented Jun 12, 2020

simonjayhawkins commented Jun 12, 2020

simonjayhawkins Jun 12, 2020

simonjayhawkins Jun 14, 2020

simonjayhawkins Jun 14, 2020

jreback commented Jun 14, 2020

jreback commented Jun 16, 2020

simonjayhawkins commented Jun 16, 2020

simonjayhawkins commented Jul 24, 2020

PERF: remove use of Python sets for interpolate #34727

PERF: remove use of Python sets for interpolate #34727

Conversation

simonjayhawkins commented Jun 12, 2020

simonjayhawkins commented Jun 12, 2020

simonjayhawkins Jun 12, 2020

Choose a reason for hiding this comment

simonjayhawkins Jun 14, 2020

Choose a reason for hiding this comment

simonjayhawkins Jun 14, 2020

Choose a reason for hiding this comment

jreback commented Jun 14, 2020

jreback commented Jun 16, 2020

simonjayhawkins commented Jun 16, 2020

simonjayhawkins commented Jul 24, 2020