Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: optimize algos.take for repeated calls #39692

Merged
merged 24 commits into from
Mar 5, 2021

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 9, 2021

This PR optimizes our internal take algorithm, with the following changes:

  • Cache the expensive parts that are independent of the actual array values but only dependent on the dtypes / dimension / axis / fill_value (which is often the same for repeated calls), i.e. maybe_promote and _get_take_nd_function
  • The above is done in the generic take_nd function as well, but in addition I also added a take_1d_array specialized version that assumes the input is already an array and only deals with 1D. This gives a small further speed-up.

An example use case where the effect of this optimization can be noticed nicely is unstack in case of the ArrayManager. This is typically a case where we do multiple take calls for 1 column (to split it into multiple columns) and thus call it many times for the same dtype/fill_value.

Using the example from the reshape.py::SimpleReshape ASV benchmark (this is an example of homogeneous dtype with a single block, so which is fast with the BlockManager using the _can_fast_transpose fastpath in _unstack_frame):

arrays = [np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]
index = pd.MultiIndex.from_arrays(arrays)
df = pd.DataFrame(np.random.randn(10000, 4), index=index)
df_am = df._as_manager("array")

Master:

In [2]: %timeit df.unstack()
1.79 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit df_am.unstack()
15.1 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

PR:

In [2]: %timeit df.unstack()
1.81 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df_am.unstack()
3.44 ms ± 69.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So for the BlockManager this doesn't matter (it does only a single call to take_nd for the column values), but for the ArrayManager version this gives a nice speedup and brings it within 2x of the BlockManager version (for this benchmark).

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Feb 9, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Feb 9, 2021
else:
# check for promotion based on types only (do this first because
# it's faster than computing a mask)
dtype, fill_value = maybe_promote_cached(arr.dtype, fill_value)
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is also copied verbatim out of take_nd (to be reused), the only actual code change is to use here _maybe_promote_cached instead of maybe_promote

@jorisvandenbossche
Copy link
Member Author

As an example benchmark of unstack that doesn't use the 1-block fastpath (from reshape.py::Unstack::time_full_product[category]):

import string

m = 100
n = 50

levels = np.arange(m)
index = pd.MultiIndex.from_product([levels] * 2)
columns = np.arange(n)
indices = np.random.randint(0, 52, size=(m * m, n))
values = np.take(list(string.ascii_letters), indices)
values = [pd.Categorical(v) for v in values.T]
df = pd.DataFrame(values, index, columns)
df_am = df._as_manager("array")

Master:

In [2]: %timeit df.unstack()
374 ms ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df_am.unstack()
306 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PR:

In [2]: %timeit df.unstack()
309 ms ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df_am.unstack()
177 ms ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So here it improves for BlockManager as well (and in this case ArrayManager is faster, because it has less overhead creating all the blocks for each column)

(specifically for this case of Categorical data, there is a lot of room for improvement by making the categorical constructor faster for such cases where less validation is needed, but that's another topic)

pandas/core/algorithms.py Outdated Show resolved Hide resolved
@functools.lru_cache(maxsize=128)
def __get_take_nd_function_cached(ndim, arr_dtype, out_dtype, axis):
"""
Part of _get_take_nd_function below that doesn't need the mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this coment does't make any sense w/o the PR context. can you put / move a doc-string here. typing a +1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "and thus can be cached" on the next line is the essential continuation of the sentence.
The mask can be an array, and thus is not hashable and thus cannot be used as argument for a cached function.


return None


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the caching not on this function? having too many levels of indirection is -1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will clarify the comment above, the mask_info argument to this function is not hashable

@jreback
Copy link
Contributor

jreback commented Feb 10, 2021

@jorisvandenbossche a general comment. target PRs are much better for this type of refactoring. Sure occasionally bigger ones are needed. But please try to make small incremental changes (I know you probably feel this is a small incremental change), but its not. Meaning break things into multple PRs where you first move things, then change them up. This may seem like more work on your part, and migth be a bit more, but it will make reviews faster and less back and forth and arguing.

@jorisvandenbossche
Copy link
Member Author

I think indeed this is a rather targetted change. A large part of the diff is caused by copying part of a function in helper functions to be able to reuse it (and I commented on those parts to indicate where this happened).

Sure I can do the move before this PR.

@jbrockmendel
Copy link
Member

@jorisvandenbossche the timings in the OP look like this makes the non-ArrayManager code very slightly slower (well within margin of error) for this particular use case. are there other use cases where this improves perf?

@jreback
Copy link
Contributor

jreback commented Feb 10, 2021

yea pls remove the alias would help

@jorisvandenbossche
Copy link
Member Author

-> precursor just moving code: #39728

@jorisvandenbossche
Copy link
Member Author

the timings in the OP look like this makes the non-ArrayManager code very slightly slower (well within margin of error) for this particular use case. are there other use cases where this improves perf?

It's not slower, but just small variation in timing for something that didn't change noticeably. At least, the precision of the timing is not good enough to know if it's tiny bit slower or tiny bit faster. I could have repeated it to have the difference the other way around ;))

I didn't check where this would improve performance for the BlockManager noticeably. In principle any case where the performance is dominated by the overhead of take (so you need a case where take is called many times), but don't know from the top of my head what such a case could be.

I suppose that you could measure the benefit in a single take (eg reindex) call as well.

@jbrockmendel
Copy link
Member

can you add a comment along the lines of # TODO(AM): this can be removed if we dont end up using ArrayManager

@jorisvandenbossche
Copy link
Member Author

can you add a comment along the lines of # TODO(AM): this can be removed if we dont end up using ArrayManager

Where do you want me to add that comment? Specifically on the specialized take_1d_array?
I think the hashing is probably useful in general as well.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 2, 2021

Updated (and also removed the take_1d specialization for now, to focus on the caching, since that both gives the most significant improvement as generates the most discussion).

2. Since the hashability check can go at the top of maybe_promote, we can then refactor the rest of maybe_promote into the cached _maybe_promote.

OK, so that version looks more or less like this (@jbrockmendel did I understand that correctly?)

def maybe_promote_with_cache_version1(dtype: np.dtype, fill_value=np.nan):
    if not is_hashable(fill_value):
        if not is_object_dtype(dtype):
            raise ValueError("fill_value must be a scalar")
        return dtype, fill_value
    return _maybe_promote_cached(dtype, fill_value, type(fill_value))

@functools.lru_cache(maxsize=128)
def _maybe_promote_cached(dtype, fill_value, fill_value_type):
    ... current implementation of maybe_promote ...

while the version I have in this PR basically looks like:

def maybe_promote_with_cache_version2(dtype, fill_value):
    try:
        return _maybe_promote_cached(dtype, fill_value, type(fill_value))
    except TypeError:
        return _maybe_promote(dtype, fill_value) 

@functools.lru_cache(maxsize=128)
def _maybe_promote_cached(dtype, fill_value, fill_value_type):
    return _maybe_promote(dtype, fill_value)

def _maybe_promote(dtype, fill_value):
    ... current implementation of maybe_promote ...

(and to be clear the maybe_promote_with_cache_version1/2 would then become the actual maybe_promote that is used elsewhere in pandas, just using those names to distinguish and test/time them next to each other)

Timing those options:

In [2]: dtype = np.dtype(float)
   ...: fill_value = np.nan

In [3]: %timeit maybe_promote(dtype, fill_value)   # <--- the current, non-cached version
2.36 µs ± 10 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: %timeit maybe_promote_with_cache_version1(dtype, fill_value)
303 ns ± 3.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit maybe_promote_with_cache_version2(dtype, fill_value)
201 ns ± 3.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So the version with try/except is clearly faster as first version with caching. But of course both are in the nanoseconds, so in absolute numbers the difference is not that big.

For the example that I use in the top post, maybe_promote gets called 400 times, which in this case means 400 x 100ns = 40µs gain by using the second version, while the full unstack operation takes around 3ms. Which of course is only a small precentage.

Personally I don't think both versions are that different in terms of complexity. But so if there is a strong preference for the slightly slower version 1 compared to version 2, I am fine with either (the speed-up on the non-microbenchmark is not very significant).

@jreback
Copy link
Contributor

jreback commented Mar 2, 2021

@jorisvandenbossche ok to go with your current version, i would just add a comment explaining the try/except (e.g. that non-hashables will except, maybe you did this alreadY).

@jbrockmendel
Copy link
Member

ill take another look this afternoon

@jbrockmendel
Copy link
Member

AFAICT there isnt any functional difference between maybe_promote_with_cache_version1 vs maybe_promote_with_cache_version2, so lets go with the faster one. can you move it to dtypes.cast and we'll get this done

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, can you merge master to make sure this is green.

@@ -177,41 +178,60 @@ def take_2d_multi(
return out


@functools.lru_cache(maxsize=128)
def __get_take_nd_function_cached(ndim, arr_dtype, out_dtype, axis):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be dunder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, changed

Comment on lines +572 to +575
# TODO(2.0): need to directly use the non-cached version as long as we
# possibly raise a deprecation warning for datetime dtype
if dtype.kind == "M":
return _maybe_promote(dtype, fill_value)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unfortunate, but to ensure the warning is always shown, we can't use the cached version for datetime data.

I check what would be the fastest option. The most specific check would be if isinstance(fill_value, date) and not isinstance(fill_value, datetime), but if dtype.kind == "M" is a bit faster.
So the trade-off was between faster for all non-M8 dtypes vs faster for M8 (by being able to use the cached version in most cases) but a bit slower for all other dtypes. So I went with the first (fastest for numeric dtypes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how big the is the perf tradeoff?

since stacklevels are a constant hassle, one option would be to take the find_stacklevel function and change it so that instead of hard-coding "astype" it just looks for the first call that isn't from inside (non-test) pandas

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the stacklevel as such, it's the warning itself. With caching, it occurs only once, while otherwise this warning is raised every time you use it.

The other option would be to check for this case / raise the warning a level higher up (so eg the line we are commenting up), so that other cases still use the cached version.

@jreback jreback merged commit ec56dd2 into pandas-dev:master Mar 5, 2021
@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

thanks @jorisvandenbossche very nice

@jbrockmendel
Copy link
Member

@jorisvandenbossche im looking at pandas/tests/indexing/multiindex/test_indexing_slow.py::test_multiindex_get_loc and finding that disabling the lru_cache cuts the runtime in about half (for BM, about 40% for AM). Can you confirm?

@jbrockmendel
Copy link
Member

jbrockmendel commented Feb 21, 2023

Disabling the cache speeds up the test suite by about 20% for me. I take it back, was using an out of date baseline.

@jorisvandenbossche
Copy link
Member Author

Do you have a specific reproducer as a small snippet (outside of pytest)? When running those tests, I do indeed see a speedup when disabling the cache, but running that test includes both pytest overhead as several ways we end up calling take (sort_values, the actual MI getitem, in the assert functions, ..), so might be easier with a narrowed down use case to profile it / understand what is going on.

@jbrockmendel
Copy link
Member

Trimming down test_multiindex_get_loc to something minimal-ish:

import numpy as np
import pandas as pd

m = 50
n = 1000
cols = ["jim", "joe", "jolie", "joline", "jolia"]

vals = [
    np.random.randint(0, 10, n),
    np.random.choice(list("abcdefghij"), n),
    np.random.choice(pd.date_range("20141009", periods=10).tolist(), n),
    np.random.choice(list("ZYXWVUTSRQ"), n),
    np.random.randn(n),
]
vals = list(map(tuple, zip(*vals)))

df = pd.DataFrame(vals, columns=cols)
mi = df.set_index(cols[:-1])

key = (8, 'h', pd.Timestamp('2014-10-17 00:00:00'), 'S')
i = 0
k = key[i]

mask = df.iloc[:, i] == k

%timeit right = df[mask]
1.57 ms ± 53.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- main
174 µs ± 4.26 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  # <- cache disabled

Removing the date_range from vals (and an entry from cols) brings the two back to having the cache case outperforming by a few percent.

First guess is that for some reason dt64NaT is causing a cache miss, but calling _maybe_promote_cached with (np.dtype("M8[ns]"), np.datetime64("NaT", "ns), np.datetime64) isn't showing a miss.

@jorisvandenbossche
Copy link
Member Author

So it seems that it is "taking" the datetimelike values then, so further trimming it down to just that:

n = 1000
vals = np.random.choice(pd.date_range("20141009", periods=10).tolist(), n)
ser = pd.Series(vals)
mask = np.random.randint(0, 10, n) == 0

%timeit ser[mask]

df = pd.DataFrame({"col": vals})

%timeit df[mask]

Some observations:

  • The slowdown only occurs in the dataframe case, not for series. But that's simply because the series case's code path doesn't pass by those cache functions
  • It's due to _maybe_promote_cached, not the _get_take_nd_function_cached
  • I checked the time of hashing both dtype = np.dtype('<M8[ns]') and fill_value = np.datetime64('NaT'), but nothing seems concerning about this
  • Also timing explicitly the _maybe_promote_(cached) doesn't seem to show anything:
In [1]: dtype = np.dtype('<M8[ns]')
   ...: fill_value = np.datetime64('NaT')

# version without cache
In [2]: %timeit pd.core.dtypes.cast._maybe_promote(dtype, fill_value)
2.8 µs ± 49.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [3]: %timeit pd.core.dtypes.cast.maybe_promote(dtype, fill_value)
468 ns ± 7.62 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@jbrockmendel
Copy link
Member

Thanks for taking a look. I get similar numbers when profiling just the _?maybe_promote functions.

Speculation: each time we call maybe_promote with (np.dtype("M8[ns]"), np.datetime64("NaT", "ns")), we are actually passing a new dt64 object. When doing the dict lookup it correctly gets hash-equality and then checks for list-equality, which then shows up as False. So we get cache misses.

Let's check this out... At the module-level in dtypes.cast I define dt64nat = np.datetime64("NaT") and in maybe_promote I do

    if isinstance(fill_value, np.datetime64) and isna(fill_value):
        fill_value = dt64nat

Then using the example from #39692 (comment)

In [2]: %timeit right = df[mask]
182 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

about 8 microseconds slower than the disabling-the-cache measurement, which seems reasonable for the extra check added. Calling this a win.

Presumably we'd see the same issue with td64nat or if any case we passed a NaN that wasn't specifically np.nan (which is probably rare).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants