Optimize array_equivalent for NDFrame.equals #35328

TomAugspurger · 2020-07-17T15:51:14Z

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: N = 10**3
In [4]: float_df = pd.DataFrame(np.random.randn(N, N))
# 1.0.5
1.52 ms ± 94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# master
83.6 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# branch
31.2 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And proof of this being better for short-circuiting:

In [4]: N = 10**3

In [5]: a = pd.DataFrame(np.random.randn(N, N))

In [6]: b = pd.DataFrame(np.random.randn(N, N))

In [7]: %timeit a.equals(b)
# 1.0.5
1.97 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit a.equals(b)
56.4 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Discussion: the key thing here is that BlockManager.equals requires that the dtypes of left and right match. If they don't then we've already returned False. That lets us do a few things

Avoid many is_foo_dtype checks.
Use optimized checks like np.isnan
Avoid the slow case of considering np.array([1, 2]) equivalent to np.array([1.0, 2.0]).

…alent-opt

simonjayhawkins · 2020-07-17T17:24:22Z

restarting failed travis job

=========================== short test summary info ============================
FAILED pandas/tests/io/parser/test_common.py::test_chunks_have_consistent_numerical_type[python]
= 1 failed, 69943 passed, 4255 skipped, 1020 xfailed, 13 warnings in 808.93s (0:13:28) =

jreback · 2020-07-17T17:30:09Z

pandas/core/dtypes/missing.py

@@ -391,43 +399,28 @@ def array_equivalent(left, right, strict_nan: bool = False) -> bool:
    if left.shape != right.shape:
        return False

+    if dtype_equal:


i am not sure i understand why you actually need this
the dtype check is cheap compared to the actual check no?

They add up, especially when doing things columnwise for small, wide dataframes

We get to skip a few other checks, like swapping np.isnan for pd.isna for float, and we can skip the empty check at https://github.com/pandas-dev/pandas/pull/35328/files#diff-ff8364cee9a3e1ef3a3825cb2cdd26d8L431.

ok that's fine

jreback · 2020-07-17T17:40:38Z

pandas/core/dtypes/missing.py

@@ -452,6 +445,45 @@ def array_equivalent(left, right, strict_nan: bool = False) -> bool:
    return np.array_equal(left, right)


+def array_equivalent_float(left, right):


maybe make all of these helpers private functions

…alent-opt

jbrockmendel · 2020-07-17T17:53:08Z

pandas/core/dtypes/missing.py

@@ -452,6 +445,45 @@ def array_equivalent(left, right, strict_nan: bool = False) -> bool:
    return np.array_equal(left, right)


+def _array_equivalent_float(left, right):
+    return ((left == right) | (np.isnan(left) & np.isnan(right))).all()


what about use_inf_as_na?

Can this be re-used on L426? (also the prod(shape) check on L424-425 i think is redundant with earlier shape check)

It seems like 1.0.5 doesn't care about that

In [15]: df1 = pd.DataFrame({"A": np.array([np.nan, 1, np.inf])}) In [16]: df2 = pd.DataFrame({"A": np.array([np.nan, 1, np.nan])}) In [17]: with pd.option_context('mode.use_inf_as_na', True): ...: print(df1.equals(df2)) ...: ...: False

Makes sense, thanks for checking

TomAugspurger · 2020-07-17T18:49:21Z

Travis finished. I'm going to selfishly ignore #35328 (comment) so that I can get the RC process rolling, but happy to engage more on followups if I missed something.

TomAugspurger added 3 commits July 16, 2020 15:16

wip

c0d6a40

Merge remote-tracking branch 'upstream/master' into 35249-array-equiv…

46a27b9

…alent-opt

PERF: Optimize array_equivalent

94f1ba9

TomAugspurger added the Performance Memory or execution speed performance label Jul 17, 2020

TomAugspurger added this to the 1.1 milestone Jul 17, 2020

partial revert

f551a20

jreback requested changes Jul 17, 2020

View reviewed changes

jreback reviewed Jul 17, 2020

View reviewed changes

TomAugspurger mentioned this pull request Jul 17, 2020

RLS: 1.1 #34730

Closed

TomAugspurger added 2 commits July 17, 2020 12:45

Merge remote-tracking branch 'upstream/master' into 35249-array-equiv…

0c83457

…alent-opt

rename

b5d06a9

jreback approved these changes Jul 17, 2020

View reviewed changes

jbrockmendel reviewed Jul 17, 2020

View reviewed changes

TomAugspurger merged commit d33c801 into pandas-dev:master Jul 17, 2020

TomAugspurger deleted the 35249-array-equivalent-opt branch July 17, 2020 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize array_equivalent for NDFrame.equals #35328

Optimize array_equivalent for NDFrame.equals #35328

TomAugspurger commented Jul 17, 2020 •

edited

Loading

simonjayhawkins commented Jul 17, 2020

jreback Jul 17, 2020

TomAugspurger Jul 17, 2020

jreback Jul 17, 2020

jreback Jul 17, 2020

jbrockmendel Jul 17, 2020

TomAugspurger Jul 17, 2020

jbrockmendel Jul 17, 2020

TomAugspurger commented Jul 17, 2020

		@@ -452,6 +445,45 @@ def array_equivalent(left, right, strict_nan: bool = False) -> bool:
		return np.array_equal(left, right)


		def array_equivalent_float(left, right):

Optimize array_equivalent for NDFrame.equals #35328

Optimize array_equivalent for NDFrame.equals #35328

Conversation

TomAugspurger commented Jul 17, 2020 • edited Loading

simonjayhawkins commented Jul 17, 2020

jreback Jul 17, 2020

Choose a reason for hiding this comment

TomAugspurger Jul 17, 2020

Choose a reason for hiding this comment

jreback Jul 17, 2020

Choose a reason for hiding this comment

jreback Jul 17, 2020

Choose a reason for hiding this comment

jbrockmendel Jul 17, 2020

Choose a reason for hiding this comment

TomAugspurger Jul 17, 2020

Choose a reason for hiding this comment

jbrockmendel Jul 17, 2020

Choose a reason for hiding this comment

TomAugspurger commented Jul 17, 2020

TomAugspurger commented Jul 17, 2020 •

edited

Loading