Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in lineplot with ci=None (or errorbar=None) #3006

Closed
palao opened this issue Sep 8, 2022 · 4 comments · Fixed by #3081
Closed

Performance regression in lineplot with ci=None (or errorbar=None) #3006

palao opened this issue Sep 8, 2022 · 4 comments · Fixed by #3081

Comments

@palao
Copy link

palao commented Sep 8, 2022

Hello,
I have noticed that seaborn-0.12.0 takes much more time (see the numbers below) to create a lineplot without error bars than seaborn-0.11.2 used to.

The following code snippet demonstrates the problem:

import numpy as np
import seaborn as sns
from pandas import DataFrame
from matplotlib import pyplot as plt

N = 1000000
x = np.random.randint(0, high=10000000000, size=N)
y = np.random.random(N)
df = DataFrame({"x": x, "y": y})

print(f"Using seaborn-{sns.__version__}:")

plot = sns.lineplot(data=df, x="x", y="y", ci=None)

Measuring the runtime using seaborn-0.11.2 one gets:

$ time python devel/examples/seaborn/slow_lineplot.py 
Using seaborn-0.11.2:

real    0m3,046s
user    0m3,302s
sys     0m1,603s

But with seaborn-0.12.0:

$ time python devel/examples/seaborn/slow_lineplot.py 
Using seaborn-0.12.0:
/home/palao/devel/examples/seaborn/slow_lineplot.py:35: FutureWarning: 

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  plot = sns.lineplot(

real    7m9,201s
user    7m7,049s
sys     0m2,920s

Note: replacing ci=None with errorbar=None has no impact in the performance of seaborn-0.12.0 (and obviously errorbar is not a valid keyword parameter in seaborn-0.11.2).

After inspecting the code, it seems to me that even if ci=None is set (or errorbar=None) in version 0.12.0, some kind of error bars are computed and that has a huge impact in the performance when the plot has many point. In a real case plot, the runtime to produce it was around one hour, almost 100 times more that it used to be with seaborn-0.11.2

I'm using CPython-3.9.13 and matplotlib-3.5.1

My only solution for now is to pin seaborn==0.11.2

Is this a real bug? Or am I doing/understanding something wrong?

@palao palao changed the title Performance regression in lineplot with ci=None (or errorbar=None) Performance regression in lineplot with ci=None (or errorbar=None) Sep 8, 2022
@mwaskom mwaskom added this to the v0.12.1 milestone Sep 8, 2022
@mwaskom
Copy link
Owner

mwaskom commented Sep 8, 2022

I can reproduce the perf regression here but I don't think the issue is the errorbar implementation per se (that is it's not computing error bars even though you're asking it not to). Instead I think it's changes to the aggregation logic that support the more generic errorbar interface. If I change your example to N = 10_000 (which is more manageable) I get

%time plot = sns.lineplot(data=df, x="x", y="y", errorbar=None)
CPU times: user 3.78 s, sys: 55.6 ms, total: 3.83 s
Wall time: 3.98 s

But if I do estimator=None instead of errorbar=None I get

%time plot = sns.lineplot(data=df, x="x", y="y", estimator=None)
CPU times: user 72.3 ms, sys: 5.93 ms, total: 78.2 ms
Wall time: 88.1 ms

So my guess is that the new implementation is evaluating the estimator function for each unique x value, whereas before it may have been short-circuiting.

We should try to address the performance regression, but I think you want to put estimator=None either way.

@jianghui1229
Copy link

学习中!

@palao
Copy link
Author

palao commented Sep 12, 2022

@mwaskom , thank you for the suggestion!
And thank you for the great work with seaborn.

@mwaskom
Copy link
Owner

mwaskom commented Oct 13, 2022

Some insight from profiling:

Nearly all of the time spent in _LinePlotter.plot is spend doing the aggregation:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   378                                               def plot(self, ax, kws):
   ...
   430         1          0.0      0.0      0.0              if self.estimator is not None:
   431         1          0.0      0.0      0.0                  if "units" in self.variables:
   432                                                               # TODO eventually relax this constraint
   433                                                               err = "estimator must be None when specifying units"
   434                                                               raise ValueError(err)
   435         1          0.1      0.1      0.0                  grouped = sub_data.groupby(orient, sort=self.sort)
   436                                                           # Could pass as_index=False instead of reset_index,
   437                                                           # but that fails on a corner case with older pandas.
   438         1       2997.2   2997.2     99.0                  sub_data = grouped.apply(agg, other).reset_index()

And then within that method, most of the time is actually spent ... wrapping the outputs in a pandas Series:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   480                                               def __call__(self, data, var):
   481                                                   """Aggregate over `var` column of `data` with estimate and error interval."""
   482     10000        280.8      0.0      9.1          vals = data[var]
   483     10000          3.4      0.0      0.1          if callable(self.estimator):
   484                                                       # You would think we could pass to vals.agg, and yet:
   485                                                       # https://github.com/mwaskom/seaborn/issues/2943
   486                                                       estimate = self.estimator(vals)
   487                                                   else:
   488     10000       1332.1      0.1     43.0              estimate = vals.agg(self.estimator)
   489                                           
   490                                                   # Options that produce no error bars
   491     10000          3.1      0.0      0.1          if self.error_method is None:
   492     10000          3.6      0.0      0.1              err_min = err_max = np.nan
   ...
   516     10000       1476.6      0.1     47.6          return pd.Series({var: estimate, f"{var}min": err_min, f"{var}max": err_max})

I think the solution here is just to check whether any aggregation will happen and skip this entire step if it won't. That's implemented in #3081.

There are probably still some edge cases here ... e.g. if you have many unique x values but a few repeats, it will get past the check and then probably run significantly slower than in v0.11. But that sounds like a situation where estimator=None is the desired parameterization anyway (and unlike here, results would be incorrect when not used). So I can live with that edge case but someone should feel free to ping if they have a compelling application that is not addressed by this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants