PERF: optimize algos.take for repeated calls #39692

jorisvandenbossche · 2021-02-09T09:58:43Z

This PR optimizes our internal take algorithm, with the following changes:

Cache the expensive parts that are independent of the actual array values but only dependent on the dtypes / dimension / axis / fill_value (which is often the same for repeated calls), i.e. maybe_promote and _get_take_nd_function
The above is done in the generic take_nd function as well, but in addition I also added a take_1d_array specialized version that assumes the input is already an array and only deals with 1D. This gives a small further speed-up.

An example use case where the effect of this optimization can be noticed nicely is unstack in case of the ArrayManager. This is typically a case where we do multiple take calls for 1 column (to split it into multiple columns) and thus call it many times for the same dtype/fill_value.

Using the example from the reshape.py::SimpleReshape ASV benchmark (this is an example of homogeneous dtype with a single block, so which is fast with the BlockManager using the _can_fast_transpose fastpath in _unstack_frame):

arrays = [np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]
index = pd.MultiIndex.from_arrays(arrays)
df = pd.DataFrame(np.random.randn(10000, 4), index=index)
df_am = df._as_manager("array")

Master:

In [2]: %timeit df.unstack()
1.79 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit df_am.unstack()
15.1 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

PR:

In [2]: %timeit df.unstack()
1.81 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df_am.unstack()
3.44 ms ± 69.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So for the BlockManager this doesn't matter (it does only a single call to take_nd for the column values), but for the ArrayManager version this gives a nice speedup and brings it within 2x of the BlockManager version (for this benchmark).

pandas/core/algorithms.py

jorisvandenbossche · 2021-02-09T10:01:17Z

pandas/core/algorithms.py

+        else:
+            # check for promotion based on types only (do this first because
+            # it's faster than computing a mask)
+            dtype, fill_value = maybe_promote_cached(arr.dtype, fill_value)


This function is also copied verbatim out of take_nd (to be reused), the only actual code change is to use here _maybe_promote_cached instead of maybe_promote

jorisvandenbossche · 2021-02-09T10:24:10Z

As an example benchmark of unstack that doesn't use the 1-block fastpath (from reshape.py::Unstack::time_full_product[category]):

import string

m = 100
n = 50

levels = np.arange(m)
index = pd.MultiIndex.from_product([levels] * 2)
columns = np.arange(n)
indices = np.random.randint(0, 52, size=(m * m, n))
values = np.take(list(string.ascii_letters), indices)
values = [pd.Categorical(v) for v in values.T]
df = pd.DataFrame(values, index, columns)
df_am = df._as_manager("array")

Master:

In [2]: %timeit df.unstack()
374 ms ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df_am.unstack()
306 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PR:

In [2]: %timeit df.unstack()
309 ms ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df_am.unstack()
177 ms ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So here it improves for BlockManager as well (and in this case ArrayManager is faster, because it has less overhead creating all the blocks for each column)

(specifically for this case of Categorical data, there is a lot of room for improvement by making the categorical constructor faster for such cases where less validation is needed, but that's another topic)

pandas/core/algorithms.py

jreback · 2021-02-10T13:52:56Z

pandas/core/algorithms.py

+@functools.lru_cache(maxsize=128)
+def __get_take_nd_function_cached(ndim, arr_dtype, out_dtype, axis):
+    """
+    Part of _get_take_nd_function below that doesn't need the mask


this coment does't make any sense w/o the PR context. can you put / move a doc-string here. typing a +1

The "and thus can be cached" on the next line is the essential continuation of the sentence.
The mask can be an array, and thus is not hashable and thus cannot be used as argument for a cached function.

jreback · 2021-02-10T13:53:38Z

pandas/core/algorithms.py

+
+    return None
+
+


why is the caching not on this function? having too many levels of indirection is -1

I will clarify the comment above, the mask_info argument to this function is not hashable

jreback · 2021-02-10T16:25:15Z

@jorisvandenbossche a general comment. target PRs are much better for this type of refactoring. Sure occasionally bigger ones are needed. But please try to make small incremental changes (I know you probably feel this is a small incremental change), but its not. Meaning break things into multple PRs where you first move things, then change them up. This may seem like more work on your part, and migth be a bit more, but it will make reviews faster and less back and forth and arguing.

jorisvandenbossche · 2021-02-10T16:29:39Z

I think indeed this is a rather targetted change. A large part of the diff is caused by copying part of a function in helper functions to be able to reuse it (and I commented on those parts to indicate where this happened).

Sure I can do the move before this PR.

jbrockmendel · 2021-02-10T16:34:58Z

@jorisvandenbossche the timings in the OP look like this makes the non-ArrayManager code very slightly slower (well within margin of error) for this particular use case. are there other use cases where this improves perf?

jreback · 2021-02-10T16:35:16Z

yea pls remove the alias would help

jorisvandenbossche · 2021-02-10T16:35:37Z

-> precursor just moving code: #39728

jorisvandenbossche · 2021-02-10T16:40:10Z

the timings in the OP look like this makes the non-ArrayManager code very slightly slower (well within margin of error) for this particular use case. are there other use cases where this improves perf?

It's not slower, but just small variation in timing for something that didn't change noticeably. At least, the precision of the timing is not good enough to know if it's tiny bit slower or tiny bit faster. I could have repeated it to have the difference the other way around ;))

I didn't check where this would improve performance for the BlockManager noticeably. In principle any case where the performance is dominated by the overhead of take (so you need a case where take is called many times), but don't know from the top of my head what such a case could be.

I suppose that you could measure the benefit in a single take (eg reindex) call as well.

pandas/core/algorithms.py

jbrockmendel · 2021-02-10T21:04:22Z

can you add a comment along the lines of # TODO(AM): this can be removed if we dont end up using ArrayManager

jorisvandenbossche · 2021-02-10T21:06:52Z

can you add a comment along the lines of # TODO(AM): this can be removed if we dont end up using ArrayManager

Where do you want me to add that comment? Specifically on the specialized take_1d_array?
I think the hashing is probably useful in general as well.

jorisvandenbossche · 2021-03-02T18:06:38Z

Updated (and also removed the take_1d specialization for now, to focus on the caching, since that both gives the most significant improvement as generates the most discussion).

2. Since the hashability check can go at the top of maybe_promote, we can then refactor the rest of maybe_promote into the cached _maybe_promote.

OK, so that version looks more or less like this (@jbrockmendel did I understand that correctly?)

def maybe_promote_with_cache_version1(dtype: np.dtype, fill_value=np.nan):
    if not is_hashable(fill_value):
        if not is_object_dtype(dtype):
            raise ValueError("fill_value must be a scalar")
        return dtype, fill_value
    return _maybe_promote_cached(dtype, fill_value, type(fill_value))

@functools.lru_cache(maxsize=128)
def _maybe_promote_cached(dtype, fill_value, fill_value_type):
    ... current implementation of maybe_promote ...

while the version I have in this PR basically looks like:

def maybe_promote_with_cache_version2(dtype, fill_value):
    try:
        return _maybe_promote_cached(dtype, fill_value, type(fill_value))
    except TypeError:
        return _maybe_promote(dtype, fill_value) 

@functools.lru_cache(maxsize=128)
def _maybe_promote_cached(dtype, fill_value, fill_value_type):
    return _maybe_promote(dtype, fill_value)

def _maybe_promote(dtype, fill_value):
    ... current implementation of maybe_promote ...

(and to be clear the maybe_promote_with_cache_version1/2 would then become the actual maybe_promote that is used elsewhere in pandas, just using those names to distinguish and test/time them next to each other)

Timing those options:

In [2]: dtype = np.dtype(float)
   ...: fill_value = np.nan

In [3]: %timeit maybe_promote(dtype, fill_value)   # <--- the current, non-cached version
2.36 µs ± 10 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: %timeit maybe_promote_with_cache_version1(dtype, fill_value)
303 ns ± 3.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit maybe_promote_with_cache_version2(dtype, fill_value)
201 ns ± 3.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So the version with try/except is clearly faster as first version with caching. But of course both are in the nanoseconds, so in absolute numbers the difference is not that big.

For the example that I use in the top post, maybe_promote gets called 400 times, which in this case means 400 x 100ns = 40µs gain by using the second version, while the full unstack operation takes around 3ms. Which of course is only a small precentage.

Personally I don't think both versions are that different in terms of complexity. But so if there is a strong preference for the slightly slower version 1 compared to version 2, I am fine with either (the speed-up on the non-microbenchmark is not very significant).

jreback · 2021-03-02T18:30:23Z

@jorisvandenbossche ok to go with your current version, i would just add a comment explaining the try/except (e.g. that non-hashables will except, maybe you did this alreadY).

jbrockmendel · 2021-03-02T18:33:31Z

ill take another look this afternoon

jbrockmendel · 2021-03-02T20:35:34Z

AFAICT there isnt any functional difference between maybe_promote_with_cache_version1 vs maybe_promote_with_cache_version2, so lets go with the faster one. can you move it to dtypes.cast and we'll get this done

jreback

looks fine, can you merge master to make sure this is green.

jbrockmendel · 2021-03-04T02:25:41Z

pandas/core/array_algos/take.py

@@ -177,41 +178,60 @@ def take_2d_multi(
    return out


+@functools.lru_cache(maxsize=128)
+def __get_take_nd_function_cached(ndim, arr_dtype, out_dtype, axis):


does this need to be dunder

Nope, changed

jorisvandenbossche · 2021-03-04T08:15:33Z

pandas/core/dtypes/cast.py

+    # TODO(2.0): need to directly use the non-cached version as long as we
+    # possibly raise a deprecation warning for datetime dtype
+    if dtype.kind == "M":
+        return _maybe_promote(dtype, fill_value)


This is a bit unfortunate, but to ensure the warning is always shown, we can't use the cached version for datetime data.

I check what would be the fastest option. The most specific check would be if isinstance(fill_value, date) and not isinstance(fill_value, datetime), but if dtype.kind == "M" is a bit faster.
So the trade-off was between faster for all non-M8 dtypes vs faster for M8 (by being able to use the cached version in most cases) but a bit slower for all other dtypes. So I went with the first (fastest for numeric dtypes).

how big the is the perf tradeoff?

since stacklevels are a constant hassle, one option would be to take the find_stacklevel function and change it so that instead of hard-coding "astype" it just looks for the first call that isn't from inside (non-test) pandas

It's not the stacklevel as such, it's the warning itself. With caching, it occurs only once, while otherwise this warning is raised every time you use it.

The other option would be to check for this case / raise the warning a level higher up (so eg the line we are commenting up), so that other cases still use the cached version.

jreback · 2021-03-05T01:36:08Z

thanks @jorisvandenbossche very nice

jbrockmendel · 2023-02-21T00:30:19Z

@jorisvandenbossche im looking at pandas/tests/indexing/multiindex/test_indexing_slow.py::test_multiindex_get_loc and finding that disabling the lru_cache cuts the runtime in about half (for BM, about 40% for AM). Can you confirm?

jbrockmendel · 2023-02-21T00:56:56Z

~~Disabling the cache speeds up the test suite by about 20% for me.~~ I take it back, was using an out of date baseline.

jorisvandenbossche · 2023-02-21T22:21:11Z

Do you have a specific reproducer as a small snippet (outside of pytest)? When running those tests, I do indeed see a speedup when disabling the cache, but running that test includes both pytest overhead as several ways we end up calling take (sort_values, the actual MI getitem, in the assert functions, ..), so might be easier with a narrowed down use case to profile it / understand what is going on.

jbrockmendel · 2023-02-21T23:01:28Z

Trimming down test_multiindex_get_loc to something minimal-ish:

import numpy as np
import pandas as pd

m = 50
n = 1000
cols = ["jim", "joe", "jolie", "joline", "jolia"]

vals = [
    np.random.randint(0, 10, n),
    np.random.choice(list("abcdefghij"), n),
    np.random.choice(pd.date_range("20141009", periods=10).tolist(), n),
    np.random.choice(list("ZYXWVUTSRQ"), n),
    np.random.randn(n),
]
vals = list(map(tuple, zip(*vals)))

df = pd.DataFrame(vals, columns=cols)
mi = df.set_index(cols[:-1])

key = (8, 'h', pd.Timestamp('2014-10-17 00:00:00'), 'S')
i = 0
k = key[i]

mask = df.iloc[:, i] == k

%timeit right = df[mask]
1.57 ms ± 53.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- main
174 µs ± 4.26 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  # <- cache disabled

Removing the date_range from vals (and an entry from cols) brings the two back to having the cache case outperforming by a few percent.

First guess is that for some reason dt64NaT is causing a cache miss, but calling _maybe_promote_cached with (np.dtype("M8[ns]"), np.datetime64("NaT", "ns), np.datetime64) isn't showing a miss.

jorisvandenbossche · 2023-02-22T08:08:44Z

So it seems that it is "taking" the datetimelike values then, so further trimming it down to just that:

n = 1000
vals = np.random.choice(pd.date_range("20141009", periods=10).tolist(), n)
ser = pd.Series(vals)
mask = np.random.randint(0, 10, n) == 0

%timeit ser[mask]

df = pd.DataFrame({"col": vals})

%timeit df[mask]

Some observations:

The slowdown only occurs in the dataframe case, not for series. But that's simply because the series case's code path doesn't pass by those cache functions
It's due to _maybe_promote_cached, not the _get_take_nd_function_cached
I checked the time of hashing both dtype = np.dtype('<M8[ns]') and fill_value = np.datetime64('NaT'), but nothing seems concerning about this
Also timing explicitly the _maybe_promote_(cached) doesn't seem to show anything:

In [1]: dtype = np.dtype('<M8[ns]')
   ...: fill_value = np.datetime64('NaT')

# version without cache
In [2]: %timeit pd.core.dtypes.cast._maybe_promote(dtype, fill_value)
2.8 µs ± 49.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [3]: %timeit pd.core.dtypes.cast.maybe_promote(dtype, fill_value)
468 ns ± 7.62 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

jbrockmendel · 2023-02-23T00:26:36Z

Thanks for taking a look. I get similar numbers when profiling just the _?maybe_promote functions.

Speculation: each time we call maybe_promote with (np.dtype("M8[ns]"), np.datetime64("NaT", "ns")), we are actually passing a new dt64 object. When doing the dict lookup it correctly gets hash-equality and then checks for list-equality, which then shows up as False. So we get cache misses.

Let's check this out... At the module-level in dtypes.cast I define dt64nat = np.datetime64("NaT") and in maybe_promote I do

    if isinstance(fill_value, np.datetime64) and isna(fill_value):
        fill_value = dt64nat

Then using the example from #39692 (comment)

In [2]: %timeit right = df[mask]
182 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

about 8 microseconds slower than the disabling-the-cache measurement, which seems reasonable for the extra check added. Calling this a win.

Presumably we'd see the same issue with td64nat or if any case we passed a NaN that wasn't specifically np.nan (which is probably rare).

PERF: optimize algos.take for repeated calls

4512f9c

jorisvandenbossche added Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Feb 9, 2021

jorisvandenbossche added this to the 1.3 milestone Feb 9, 2021

jorisvandenbossche commented Feb 9, 2021

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

jorisvandenbossche commented Feb 9, 2021

View reviewed changes

jorisvandenbossche added 4 commits February 9, 2021 11:54

fix nd check + fix cache differentiation of int / bool

36c3ed2

fix non-scalar fill_value case

6d52932

fix mypy

ded773a

try fix mypy

2ee2543

jorisvandenbossche mentioned this pull request Feb 9, 2021

[ArrayManager] REF: Implement concat with reindexing #39612

Merged

jbrockmendel reviewed Feb 9, 2021

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 9, 2021

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 9, 2021

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

fix annotation

f489ba5

jreback requested changes Feb 10, 2021

View reviewed changes

improve docstrings

96305c5

jorisvandenbossche mentioned this pull request Feb 10, 2021

REF: separate out indexer/mask preprocessing code in algorithms.take_nd #39728

Merged

Merge remote-tracking branch 'upstream/master' into am-perf-take

480d2b4

jorisvandenbossche mentioned this pull request Feb 10, 2021

REF: remove take_1d alias of take_nd #39731

Merged

jbrockmendel reviewed Feb 10, 2021

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

jorisvandenbossche added 2 commits March 2, 2021 17:35

Merge remote-tracking branch 'upstream/master' into am-perf-take

06a3901

use take_nd for now

ca30487

jorisvandenbossche added 5 commits March 2, 2021 21:59

Merge remote-tracking branch 'upstream/master' into am-perf-take

bf598a7

move caching of maybe_promote to cast.py

05b6b87

move type comment

2284813

Merge remote-tracking branch 'upstream/master' into am-perf-take

4861fdb

typo

a41ee6b

jreback approved these changes Mar 4, 2021

View reviewed changes

jbrockmendel reviewed Mar 4, 2021

View reviewed changes

jorisvandenbossche added 3 commits March 4, 2021 08:05

Merge remote-tracking branch 'upstream/master' into am-perf-take

b52e1ec

ensure deprecation warning is always raised

76371cf

single underscore

2faf70b

jorisvandenbossche commented Mar 4, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into am-perf-take

1c19732

jreback merged commit ec56dd2 into pandas-dev:master Mar 5, 2021

jorisvandenbossche deleted the am-perf-take branch March 5, 2021 07:48

jorisvandenbossche mentioned this pull request Mar 5, 2021

PERF: specialized 1D take version #40246

Merged

jorisvandenbossche mentioned this pull request Nov 21, 2021

CLN/PERF: remove unnecessary ensure_platform_int #44563

Merged

jbrockmendel mentioned this pull request Feb 23, 2023

PERF: maybe_promote #51592

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: optimize algos.take for repeated calls #39692

PERF: optimize algos.take for repeated calls #39692

jorisvandenbossche commented Feb 9, 2021 •

edited

Loading

jorisvandenbossche Feb 9, 2021 •

edited

Loading

jorisvandenbossche commented Feb 9, 2021

jreback Feb 10, 2021

jorisvandenbossche Feb 10, 2021

jreback Feb 10, 2021

jorisvandenbossche Feb 10, 2021

jreback commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jbrockmendel commented Feb 10, 2021

jreback commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jbrockmendel commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jorisvandenbossche commented Mar 2, 2021 •

edited

Loading

jreback commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jreback left a comment

jbrockmendel Mar 4, 2021

jorisvandenbossche Mar 4, 2021

jorisvandenbossche Mar 4, 2021

jbrockmendel Mar 4, 2021

jorisvandenbossche Mar 4, 2021

jreback commented Mar 5, 2021

jbrockmendel commented Feb 21, 2023

jbrockmendel commented Feb 21, 2023 •

edited

Loading

jorisvandenbossche commented Feb 21, 2023

jbrockmendel commented Feb 21, 2023

jorisvandenbossche commented Feb 22, 2023

jbrockmendel commented Feb 23, 2023

PERF: optimize algos.take for repeated calls #39692

PERF: optimize algos.take for repeated calls #39692

Conversation

jorisvandenbossche commented Feb 9, 2021 • edited Loading

jorisvandenbossche Feb 9, 2021 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jbrockmendel commented Feb 10, 2021

jreback commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jbrockmendel commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jorisvandenbossche commented Mar 2, 2021 • edited Loading

jreback commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 5, 2021

jbrockmendel commented Feb 21, 2023

jbrockmendel commented Feb 21, 2023 • edited Loading

jorisvandenbossche commented Feb 21, 2023

jbrockmendel commented Feb 21, 2023

jorisvandenbossche commented Feb 22, 2023

jbrockmendel commented Feb 23, 2023

jorisvandenbossche commented Feb 9, 2021 •

edited

Loading

jorisvandenbossche Feb 9, 2021 •

edited

Loading

jorisvandenbossche commented Mar 2, 2021 •

edited

Loading

jbrockmendel commented Feb 21, 2023 •

edited

Loading