Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize array_equivalent for NDFrame.equals #35328

Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jul 17, 2020

Closes #35249

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: N = 10**3
In [4]: float_df = pd.DataFrame(np.random.randn(N, N))
# 1.0.5
1.52 ms ± 94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# master
83.6 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# branch
31.2 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And proof of this being better for short-circuiting:

In [4]: N = 10**3

In [5]: a = pd.DataFrame(np.random.randn(N, N))

In [6]: b = pd.DataFrame(np.random.randn(N, N))

In [7]: %timeit a.equals(b)
# 1.0.5
1.97 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit a.equals(b)
56.4 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Discussion: the key thing here is that BlockManager.equals requires that the dtypes of left and right match. If they don't then we've already returned False. That lets us do a few things

  1. Avoid many is_foo_dtype checks.
  2. Use optimized checks like np.isnan
  3. Avoid the slow case of considering np.array([1, 2]) equivalent to np.array([1.0, 2.0]).

@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Jul 17, 2020
@TomAugspurger TomAugspurger added this to the 1.1 milestone Jul 17, 2020
@simonjayhawkins
Copy link
Member

restarting failed travis job

=========================== short test summary info ============================
FAILED pandas/tests/io/parser/test_common.py::test_chunks_have_consistent_numerical_type[python]
= 1 failed, 69943 passed, 4255 skipped, 1020 xfailed, 13 warnings in 808.93s (0:13:28) =

@@ -391,43 +399,28 @@ def array_equivalent(left, right, strict_nan: bool = False) -> bool:
if left.shape != right.shape:
return False

if dtype_equal:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure i understand why you actually need this
the dtype check is cheap compared to the actual check no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. They add up, especially when doing things columnwise for small, wide dataframes
  2. We get to skip a few other checks, like swapping np.isnan for pd.isna for float, and we can skip the empty check at https://github.com/pandas-dev/pandas/pull/35328/files#diff-ff8364cee9a3e1ef3a3825cb2cdd26d8L431.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine

@@ -452,6 +445,45 @@ def array_equivalent(left, right, strict_nan: bool = False) -> bool:
return np.array_equal(left, right)


def array_equivalent_float(left, right):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make all of these helpers private functions

@TomAugspurger TomAugspurger mentioned this pull request Jul 17, 2020
@@ -452,6 +445,45 @@ def array_equivalent(left, right, strict_nan: bool = False) -> bool:
return np.array_equal(left, right)


def _array_equivalent_float(left, right):
return ((left == right) | (np.isnan(left) & np.isnan(right))).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about use_inf_as_na?

Can this be re-used on L426? (also the prod(shape) check on L424-425 i think is redundant with earlier shape check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like 1.0.5 doesn't care about that

In [15]: df1 = pd.DataFrame({"A": np.array([np.nan, 1, np.inf])})

In [16]: df2 = pd.DataFrame({"A": np.array([np.nan, 1, np.nan])})

In [17]: with pd.option_context('mode.use_inf_as_na', True):
    ...:     print(df1.equals(df2))
    ...:
    ...:
False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for checking

@TomAugspurger
Copy link
Contributor Author

Travis finished. I'm going to selfishly ignore #35328 (comment) so that I can get the RC process rolling, but happy to engage more on followups if I missed something.

@TomAugspurger TomAugspurger merged commit d33c801 into pandas-dev:master Jul 17, 2020
@TomAugspurger TomAugspurger deleted the 35249-array-equivalent-opt branch July 17, 2020 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance Regression in frame_methods.Equals.time_frame_float_equal
4 participants