Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram computation is inefficient with large samples #2550

Closed
mwaskom opened this issue Apr 14, 2021 · 6 comments · Fixed by #2570
Closed

Histogram computation is inefficient with large samples #2550

mwaskom opened this issue Apr 14, 2021 · 6 comments · Fixed by #2570

Comments

@mwaskom
Copy link
Owner

mwaskom commented Apr 14, 2021

seaborn always creates an array of bin edges and then passes that to numpy when computing the histogram:

bin_edges = self.bin_edges
if bin_edges is None:
bin_edges = self.define_bin_edges(x, weights=weights, cache=False)
density = self.stat == "density"
hist, _ = np.histogram(
x, bin_edges, weights=weights, density=density,
)

It does this to simplify the code as there are many situations where multiple histograms are to be evaluated with the same bins.

But I've only just learned that a quirk of numpy's implementation is that it uses an efficient algorithm for uniform histogram bins, but not when provided with an array of bins derived with a uniform binning rule (which aren't exactly uniform due to floating point error): numpy/numpy#14602

As a result, there's a substantial timing/scaling difference between two seemingly similar patterns:

x = np.random.normal(0, 1, size=int(5e7))
%time np.histogram(x, bins=30)
CPU times: user 551 ms, sys: 11.9 ms, total: 563 ms
bins = np.histogram_bin_edges(x, 30)
%time np.histogram(x, bins=bins)
CPU times: user 3.27 s, sys: 33.1 ms, total: 3.31 s

With another order of magnitude, the second call will run for minutes.

It would be good to avoid this in seaborn where possible. To get common bins for multiple calls, we could store the number of bins and the bin range:

n_bins = len(bins) - 1
bin_range = bins[[0, -1]]
%time np.histogram(x, bins=n_bins, range=bin_range)
CPU times: user 463 ms, sys: 6.22 ms, total: 469 ms

There may be some subtleties in terms of edge effects to bear in mind, and it will add complexity as there are situations where it will be necessary to track/use the full array of bin edges (it does seems fine to supply both bins and range, but it looks like the latter is silently ignored).

@jhncls
Copy link

jhncls commented Apr 14, 2021

Maybe the number of bins and bin_range can also be recomputed for user-supplied bins? On the condition that np.diff(bins) gives almost equal values. (Some additional care is needed to make sure the highest element falls into the last bin, without influencing the other bin edges. And log-scale might need to be looked at.) Or maybe one day this could happen automatically inside np.histogram?

@mwaskom
Copy link
Owner Author

mwaskom commented Apr 15, 2021

I don't think anything special should be done in the case where the user supplies an array. There's some discussion of this in the numpy issue. If they ultimately merge a patch that handles that case more efficiently, it will be good for seaborn users. But I just want to make sure that cases where uniform bins are inferred (from a reference rule, number of bins, or given binwidth) will be handled with efficient computation.

@ivirshup
Copy link
Contributor

ivirshup commented Apr 28, 2021

Responding to #2555 (comment) here, I took a look at the contribution of histogram computation vs Matplotlib methods. It looks to me like histogram calculation contributes a lot when you're looking at a single plot, but matplotlib quickly dominates once faceting is involved. I got all of these proportions from looking at the profile of the histplot and displot calls using snakeviz.

Looking at some profiling for this, I think the impact of histogram calculation is only going to be a large contributor to plot time in a few circumstances.

Largest fraction (50% of plot time, 6.5 seconds total)

sns.histplot(data=pd.DataFrame({"x": x}), x="x")

Surprisingly small fraction (7%, 25 seconds total):

sns.histplot(x)

But this looks like it's due to some issues with handling of wide form data (_assign_variables_wideform) where it looks like every element of x is having isinstance called on it. Should be easy to fix.

Relative time spent decreases with faceting

5 bins

Here 13-18% (multiple calls, 13% one time 5% another). Here 30% of the time is spent in ax.bar. ~19 seconds total.

df = pd.DataFrame({"x": x, "c": pd.Categorical(np.random.randint(5, size=x.shape))})

sns.displot(data=df, x="x", col="c")

10 bins

Down to ~10% of the time being spent with the histogram, ~25 seconds total

df = pd.DataFrame({"x": x, "c": pd.Categorical(np.random.randint(10, size=x.shape))})

sns.displot(data=df, x="x", col="c", col_wrap=5)

@mwaskom
Copy link
Owner Author

mwaskom commented Apr 28, 2021

Thanks for this! One thought: matplotlib subplot creation is often a bottleneck for small multiples; might be a better comparison to do hue grouping on a single axes.

@ivirshup
Copy link
Contributor

This may be more specific to my use-case (single cell stuff), but when I have large samples, I frequently also have many (>10) groups, where I feel like there are quickly diminishing returns from hues or stacked histograms. I do think the hue grouping is very useful when I've found an interesting distribution pattern and want to show it, but that's not during the exploratory phase, so the time to plot is less important to me.

That said, some timings:

In [14]: %%time
    ...: sns.displot(df, x="x", hue="c")
    ...: plt.close()
    ...: 
    ...: 
CPU times: user 19.4 s, sys: 877 ms, total: 20.3 s
Wall time: 19.6 s

In [15]: %%time
    ...: sns.displot(df, x="x", col="c")
    ...: plt.close()
    ...: 
    ...: 
CPU times: user 17.5 s, sys: 681 ms, total: 18.2 s
Wall time: 17.7 s

In [16]: %%time
    ...: sns.histplot(df, x="x", hue="c")
    ...: plt.close()
    ...: 
    ...: 
CPU times: user 14 s, sys: 541 ms, total: 14.6 s
Wall time: 14.4 s

The additional time for distplot with hue seems to be coming mostly from adding the legend.


As a side note, I've noticed a fair amount of time being taken by iter_data. I think this could be cut down limiting the number of times it's called (e.g. in some places it's only used to get the keys, while generating the values is the slow part), or caching the results. Caching would probably be easiest to do as all functions would benefit with no code needing to change, but of course has higher total memory consumption. Not sure if this is a concern, as I typically assume the plotting library may make copies of passed data.

@mwaskom
Copy link
Owner Author

mwaskom commented Apr 29, 2021

This may be more specific to my use-case (single cell stuff), but when I have large samples, I frequently also have many (>10) groups, where I feel like there are quickly diminishing returns from hues or stacked histograms

Agreed; my point was just in general that there's nothing displot can do about the subplot creation bottleneck (though if we could get some upstream improvements, that would be great) so removing it from the assessment of inefficiencies would be helpful.

Interesting that in this case, that's not what seems to dominate, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants