Histogram computation is inefficient with large samples #2550

mwaskom · 2021-04-14T11:39:54Z

seaborn always creates an array of bin edges and then passes that to numpy when computing the histogram:

Lines 354 to 361 in 66b4783

    
           bin_edges = self.bin_edges 
        
           if bin_edges is None: 
        
               bin_edges = self.define_bin_edges(x, weights=weights, cache=False) 
        
           density = self.stat == "density" 
        
           hist, _ = np.histogram( 
        
               x, bin_edges, weights=weights, density=density, 
        
           )

It does this to simplify the code as there are many situations where multiple histograms are to be evaluated with the same bins.

But I've only just learned that a quirk of numpy's implementation is that it uses an efficient algorithm for uniform histogram bins, but not when provided with an array of bins derived with a uniform binning rule (which aren't exactly uniform due to floating point error): numpy/numpy#14602

As a result, there's a substantial timing/scaling difference between two seemingly similar patterns:

x = np.random.normal(0, 1, size=int(5e7))
%time np.histogram(x, bins=30)

CPU times: user 551 ms, sys: 11.9 ms, total: 563 ms

bins = np.histogram_bin_edges(x, 30)
%time np.histogram(x, bins=bins)

CPU times: user 3.27 s, sys: 33.1 ms, total: 3.31 s

With another order of magnitude, the second call will run for minutes.

It would be good to avoid this in seaborn where possible. To get common bins for multiple calls, we could store the number of bins and the bin range:

n_bins = len(bins) - 1
bin_range = bins[[0, -1]]
%time np.histogram(x, bins=n_bins, range=bin_range)

CPU times: user 463 ms, sys: 6.22 ms, total: 469 ms

There may be some subtleties in terms of edge effects to bear in mind, and it will add complexity as there are situations where it will be necessary to track/use the full array of bin edges (it does seems fine to supply both bins and range, but it looks like the latter is silently ignored).

The text was updated successfully, but these errors were encountered:

jhncls · 2021-04-14T23:42:50Z

Maybe the number of bins and bin_range can also be recomputed for user-supplied bins? On the condition that np.diff(bins) gives almost equal values. (Some additional care is needed to make sure the highest element falls into the last bin, without influencing the other bin edges. And log-scale might need to be looked at.) Or maybe one day this could happen automatically inside np.histogram?

mwaskom · 2021-04-15T01:36:05Z

I don't think anything special should be done in the case where the user supplies an array. There's some discussion of this in the numpy issue. If they ultimately merge a patch that handles that case more efficiently, it will be good for seaborn users. But I just want to make sure that cases where uniform bins are inferred (from a reference rule, number of bins, or given binwidth) will be handled with efficient computation.

ivirshup · 2021-04-28T05:09:09Z

Responding to #2555 (comment) here, I took a look at the contribution of histogram computation vs Matplotlib methods. It looks to me like histogram calculation contributes a lot when you're looking at a single plot, but matplotlib quickly dominates once faceting is involved. I got all of these proportions from looking at the profile of the histplot and displot calls using snakeviz.

Looking at some profiling for this, I think the impact of histogram calculation is only going to be a large contributor to plot time in a few circumstances.

Largest fraction (50% of plot time, 6.5 seconds total)

sns.histplot(data=pd.DataFrame({"x": x}), x="x")

Surprisingly small fraction (7%, 25 seconds total):

sns.histplot(x)

But this looks like it's due to some issues with handling of wide form data (_assign_variables_wideform) where it looks like every element of x is having isinstance called on it. Should be easy to fix.

Relative time spent decreases with faceting

5 bins

Here 13-18% (multiple calls, 13% one time 5% another). Here 30% of the time is spent in ax.bar. ~19 seconds total.

df = pd.DataFrame({"x": x, "c": pd.Categorical(np.random.randint(5, size=x.shape))})

sns.displot(data=df, x="x", col="c")

10 bins

Down to ~10% of the time being spent with the histogram, ~25 seconds total

df = pd.DataFrame({"x": x, "c": pd.Categorical(np.random.randint(10, size=x.shape))})

sns.displot(data=df, x="x", col="c", col_wrap=5)

mwaskom · 2021-04-28T11:41:10Z

Thanks for this! One thought: matplotlib subplot creation is often a bottleneck for small multiples; might be a better comparison to do hue grouping on a single axes.

ivirshup · 2021-04-29T04:13:18Z

This may be more specific to my use-case (single cell stuff), but when I have large samples, I frequently also have many (>10) groups, where I feel like there are quickly diminishing returns from hues or stacked histograms. I do think the hue grouping is very useful when I've found an interesting distribution pattern and want to show it, but that's not during the exploratory phase, so the time to plot is less important to me.

That said, some timings:

In [14]: %%time
    ...: sns.displot(df, x="x", hue="c")
    ...: plt.close()
    ...: 
    ...: 
CPU times: user 19.4 s, sys: 877 ms, total: 20.3 s
Wall time: 19.6 s

In [15]: %%time
    ...: sns.displot(df, x="x", col="c")
    ...: plt.close()
    ...: 
    ...: 
CPU times: user 17.5 s, sys: 681 ms, total: 18.2 s
Wall time: 17.7 s

In [16]: %%time
    ...: sns.histplot(df, x="x", hue="c")
    ...: plt.close()
    ...: 
    ...: 
CPU times: user 14 s, sys: 541 ms, total: 14.6 s
Wall time: 14.4 s

The additional time for distplot with hue seems to be coming mostly from adding the legend.

As a side note, I've noticed a fair amount of time being taken by iter_data. I think this could be cut down limiting the number of times it's called (e.g. in some places it's only used to get the keys, while generating the values is the slow part), or caching the results. Caching would probably be easiest to do as all functions would benefit with no code needing to change, but of course has higher total memory consumption. Not sure if this is a concern, as I typically assume the plotting library may make copies of passed data.

mwaskom · 2021-04-29T11:01:25Z

This may be more specific to my use-case (single cell stuff), but when I have large samples, I frequently also have many (>10) groups, where I feel like there are quickly diminishing returns from hues or stacked histograms

Agreed; my point was just in general that there's nothing displot can do about the subplot creation bottleneck (though if we could get some upstream improvements, that would be great) so removing it from the assessment of inefficiencies would be helpful.

Interesting that in this case, that's not what seems to dominate, though.

mwaskom added algorithms rough-edge mod:distributions statistics labels Apr 14, 2021

mwaskom added this to the v0.12.0 milestone Apr 14, 2021

This was referenced Apr 23, 2021

Reduce redundant computation in distplot linewidth #2559

Merged

linewidth calculation slow for histograms #2555

Closed

mwaskom mentioned this issue May 4, 2021

Improve histogram efficiency #2570

Merged

mwaskom closed this as completed in #2570 May 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram computation is inefficient with large samples #2550

Histogram computation is inefficient with large samples #2550

mwaskom commented Apr 14, 2021 •

edited

Loading

jhncls commented Apr 14, 2021

mwaskom commented Apr 15, 2021

ivirshup commented Apr 28, 2021 •

edited

Loading

mwaskom commented Apr 28, 2021

ivirshup commented Apr 29, 2021

mwaskom commented Apr 29, 2021 •

edited

Loading

Histogram computation is inefficient with large samples #2550

Histogram computation is inefficient with large samples #2550

Comments

mwaskom commented Apr 14, 2021 • edited Loading

jhncls commented Apr 14, 2021

mwaskom commented Apr 15, 2021

ivirshup commented Apr 28, 2021 • edited Loading

Largest fraction (50% of plot time, 6.5 seconds total)

Surprisingly small fraction (7%, 25 seconds total):

Relative time spent decreases with faceting

5 bins

10 bins

mwaskom commented Apr 28, 2021

ivirshup commented Apr 29, 2021

mwaskom commented Apr 29, 2021 • edited Loading

mwaskom commented Apr 14, 2021 •

edited

Loading

ivirshup commented Apr 28, 2021 •

edited

Loading

mwaskom commented Apr 29, 2021 •

edited

Loading