-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: optimize algos.take for repeated calls #39692
Conversation
pandas/core/algorithms.py
Outdated
else: | ||
# check for promotion based on types only (do this first because | ||
# it's faster than computing a mask) | ||
dtype, fill_value = maybe_promote_cached(arr.dtype, fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is also copied verbatim out of take_nd
(to be reused), the only actual code change is to use here _maybe_promote_cached
instead of maybe_promote
As an example benchmark of
Master:
PR:
So here it improves for BlockManager as well (and in this case ArrayManager is faster, because it has less overhead creating all the blocks for each column) (specifically for this case of Categorical data, there is a lot of room for improvement by making the categorical constructor faster for such cases where less validation is needed, but that's another topic) |
pandas/core/algorithms.py
Outdated
@functools.lru_cache(maxsize=128) | ||
def __get_take_nd_function_cached(ndim, arr_dtype, out_dtype, axis): | ||
""" | ||
Part of _get_take_nd_function below that doesn't need the mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this coment does't make any sense w/o the PR context. can you put / move a doc-string here. typing a +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "and thus can be cached" on the next line is the essential continuation of the sentence.
The mask can be an array, and thus is not hashable and thus cannot be used as argument for a cached function.
pandas/core/algorithms.py
Outdated
|
||
return None | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the caching not on this function? having too many levels of indirection is -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will clarify the comment above, the mask_info
argument to this function is not hashable
@jorisvandenbossche a general comment. target PRs are much better for this type of refactoring. Sure occasionally bigger ones are needed. But please try to make small incremental changes (I know you probably feel this is a small incremental change), but its not. Meaning break things into multple PRs where you first move things, then change them up. This may seem like more work on your part, and migth be a bit more, but it will make reviews faster and less back and forth and arguing. |
I think indeed this is a rather targetted change. A large part of the diff is caused by copying part of a function in helper functions to be able to reuse it (and I commented on those parts to indicate where this happened). Sure I can do the move before this PR. |
@jorisvandenbossche the timings in the OP look like this makes the non-ArrayManager code very slightly slower (well within margin of error) for this particular use case. are there other use cases where this improves perf? |
yea pls remove the alias would help |
-> precursor just moving code: #39728 |
It's not slower, but just small variation in timing for something that didn't change noticeably. At least, the precision of the timing is not good enough to know if it's tiny bit slower or tiny bit faster. I could have repeated it to have the difference the other way around ;)) I didn't check where this would improve performance for the BlockManager noticeably. In principle any case where the performance is dominated by the overhead of I suppose that you could measure the benefit in a single |
can you add a comment along the lines of |
Where do you want me to add that comment? Specifically on the specialized |
Updated (and also removed the
OK, so that version looks more or less like this (@jbrockmendel did I understand that correctly?) def maybe_promote_with_cache_version1(dtype: np.dtype, fill_value=np.nan):
if not is_hashable(fill_value):
if not is_object_dtype(dtype):
raise ValueError("fill_value must be a scalar")
return dtype, fill_value
return _maybe_promote_cached(dtype, fill_value, type(fill_value))
@functools.lru_cache(maxsize=128)
def _maybe_promote_cached(dtype, fill_value, fill_value_type):
... current implementation of maybe_promote ... while the version I have in this PR basically looks like: def maybe_promote_with_cache_version2(dtype, fill_value):
try:
return _maybe_promote_cached(dtype, fill_value, type(fill_value))
except TypeError:
return _maybe_promote(dtype, fill_value)
@functools.lru_cache(maxsize=128)
def _maybe_promote_cached(dtype, fill_value, fill_value_type):
return _maybe_promote(dtype, fill_value)
def _maybe_promote(dtype, fill_value):
... current implementation of maybe_promote ... (and to be clear the Timing those options:
So the version with try/except is clearly faster as first version with caching. But of course both are in the nanoseconds, so in absolute numbers the difference is not that big. For the example that I use in the top post, Personally I don't think both versions are that different in terms of complexity. But so if there is a strong preference for the slightly slower version 1 compared to version 2, I am fine with either (the speed-up on the non-microbenchmark is not very significant). |
@jorisvandenbossche ok to go with your current version, i would just add a comment explaining the try/except (e.g. that non-hashables will except, maybe you did this alreadY). |
ill take another look this afternoon |
AFAICT there isnt any functional difference between maybe_promote_with_cache_version1 vs maybe_promote_with_cache_version2, so lets go with the faster one. can you move it to dtypes.cast and we'll get this done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine, can you merge master to make sure this is green.
pandas/core/array_algos/take.py
Outdated
@@ -177,41 +178,60 @@ def take_2d_multi( | |||
return out | |||
|
|||
|
|||
@functools.lru_cache(maxsize=128) | |||
def __get_take_nd_function_cached(ndim, arr_dtype, out_dtype, axis): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be dunder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, changed
# TODO(2.0): need to directly use the non-cached version as long as we | ||
# possibly raise a deprecation warning for datetime dtype | ||
if dtype.kind == "M": | ||
return _maybe_promote(dtype, fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit unfortunate, but to ensure the warning is always shown, we can't use the cached version for datetime data.
I check what would be the fastest option. The most specific check would be if isinstance(fill_value, date) and not isinstance(fill_value, datetime)
, but if dtype.kind == "M"
is a bit faster.
So the trade-off was between faster for all non-M8 dtypes vs faster for M8 (by being able to use the cached version in most cases) but a bit slower for all other dtypes. So I went with the first (fastest for numeric dtypes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how big the is the perf tradeoff?
since stacklevels are a constant hassle, one option would be to take the find_stacklevel function and change it so that instead of hard-coding "astype" it just looks for the first call that isn't from inside (non-test) pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the stacklevel as such, it's the warning itself. With caching, it occurs only once, while otherwise this warning is raised every time you use it.
The other option would be to check for this case / raise the warning a level higher up (so eg the line we are commenting up), so that other cases still use the cached version.
thanks @jorisvandenbossche very nice |
@jorisvandenbossche im looking at pandas/tests/indexing/multiindex/test_indexing_slow.py::test_multiindex_get_loc and finding that disabling the lru_cache cuts the runtime in about half (for BM, about 40% for AM). Can you confirm? |
|
Do you have a specific reproducer as a small snippet (outside of pytest)? When running those tests, I do indeed see a speedup when disabling the cache, but running that test includes both pytest overhead as several ways we end up calling take (sort_values, the actual MI getitem, in the assert functions, ..), so might be easier with a narrowed down use case to profile it / understand what is going on. |
Trimming down test_multiindex_get_loc to something minimal-ish:
Removing the date_range from First guess is that for some reason dt64NaT is causing a cache miss, but calling _maybe_promote_cached with (np.dtype("M8[ns]"), np.datetime64("NaT", "ns), np.datetime64) isn't showing a miss. |
So it seems that it is "taking" the datetimelike values then, so further trimming it down to just that:
Some observations:
|
Thanks for taking a look. I get similar numbers when profiling just the _?maybe_promote functions. Speculation: each time we call maybe_promote with (np.dtype("M8[ns]"), np.datetime64("NaT", "ns")), we are actually passing a new dt64 object. When doing the dict lookup it correctly gets hash-equality and then checks for list-equality, which then shows up as False. So we get cache misses. Let's check this out... At the module-level in dtypes.cast I define
Then using the example from #39692 (comment)
about 8 microseconds slower than the disabling-the-cache measurement, which seems reasonable for the extra check added. Calling this a win. Presumably we'd see the same issue with td64nat or if any case we passed a NaN that wasn't specifically |
This PR optimizes our internal
take
algorithm, with the following changes:maybe_promote
and_get_take_nd_function
take_nd
function as well, but in addition I also added atake_1d_array
specialized version that assumes the input is already an array and only deals with 1D. This gives a small further speed-up.An example use case where the effect of this optimization can be noticed nicely is
unstack
in case of the ArrayManager. This is typically a case where we do multipletake
calls for 1 column (to split it into multiple columns) and thus call it many times for the same dtype/fill_value.Using the example from the
reshape.py::SimpleReshape
ASV benchmark (this is an example of homogeneous dtype with a single block, so which is fast with the BlockManager using the_can_fast_transpose
fastpath in_unstack_frame
):Master:
PR:
So for the BlockManager this doesn't matter (it does only a single call to
take_nd
for the column values), but for the ArrayManager version this gives a nice speedup and brings it within 2x of the BlockManager version (for this benchmark).