-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram computation is inefficient with large samples #2550
Comments
Maybe the number of bins and bin_range can also be recomputed for user-supplied bins? On the condition that np.diff(bins) gives almost equal values. (Some additional care is needed to make sure the highest element falls into the last bin, without influencing the other bin edges. And log-scale might need to be looked at.) Or maybe one day this could happen automatically inside np.histogram? |
I don't think anything special should be done in the case where the user supplies an array. There's some discussion of this in the numpy issue. If they ultimately merge a patch that handles that case more efficiently, it will be good for seaborn users. But I just want to make sure that cases where uniform bins are inferred (from a reference rule, number of bins, or given binwidth) will be handled with efficient computation. |
Responding to #2555 (comment) here, I took a look at the contribution of histogram computation vs Matplotlib methods. It looks to me like histogram calculation contributes a lot when you're looking at a single plot, but matplotlib quickly dominates once faceting is involved. I got all of these proportions from looking at the profile of the Looking at some profiling for this, I think the impact of histogram calculation is only going to be a large contributor to plot time in a few circumstances. Largest fraction (50% of plot time, 6.5 seconds total)sns.histplot(data=pd.DataFrame({"x": x}), x="x") Surprisingly small fraction (7%, 25 seconds total):sns.histplot(x) But this looks like it's due to some issues with handling of wide form data ( Relative time spent decreases with faceting5 binsHere 13-18% (multiple calls, 13% one time 5% another). Here 30% of the time is spent in df = pd.DataFrame({"x": x, "c": pd.Categorical(np.random.randint(5, size=x.shape))})
sns.displot(data=df, x="x", col="c") 10 binsDown to ~10% of the time being spent with the histogram, ~25 seconds total df = pd.DataFrame({"x": x, "c": pd.Categorical(np.random.randint(10, size=x.shape))})
sns.displot(data=df, x="x", col="c", col_wrap=5) |
Thanks for this! One thought: matplotlib subplot creation is often a bottleneck for small multiples; might be a better comparison to do hue grouping on a single axes. |
This may be more specific to my use-case (single cell stuff), but when I have large samples, I frequently also have many (>10) groups, where I feel like there are quickly diminishing returns from hues or stacked histograms. I do think the That said, some timings: In [14]: %%time
...: sns.displot(df, x="x", hue="c")
...: plt.close()
...:
...:
CPU times: user 19.4 s, sys: 877 ms, total: 20.3 s
Wall time: 19.6 s
In [15]: %%time
...: sns.displot(df, x="x", col="c")
...: plt.close()
...:
...:
CPU times: user 17.5 s, sys: 681 ms, total: 18.2 s
Wall time: 17.7 s
In [16]: %%time
...: sns.histplot(df, x="x", hue="c")
...: plt.close()
...:
...:
CPU times: user 14 s, sys: 541 ms, total: 14.6 s
Wall time: 14.4 s The additional time for As a side note, I've noticed a fair amount of time being taken by |
Agreed; my point was just in general that there's nothing Interesting that in this case, that's not what seems to dominate, though. |
seaborn always creates an array of bin edges and then passes that to numpy when computing the histogram:
seaborn/seaborn/_statistics.py
Lines 354 to 361 in 66b4783
It does this to simplify the code as there are many situations where multiple histograms are to be evaluated with the same bins.
But I've only just learned that a quirk of numpy's implementation is that it uses an efficient algorithm for uniform histogram bins, but not when provided with an array of bins derived with a uniform binning rule (which aren't exactly uniform due to floating point error): numpy/numpy#14602
As a result, there's a substantial timing/scaling difference between two seemingly similar patterns:
With another order of magnitude, the second call will run for minutes.
It would be good to avoid this in seaborn where possible. To get common bins for multiple calls, we could store the number of bins and the bin range:
There may be some subtleties in terms of edge effects to bear in mind, and it will add complexity as there are situations where it will be necessary to track/use the full array of bin edges (it does seems fine to supply both
bins
andrange
, but it looks like the latter is silently ignored).The text was updated successfully, but these errors were encountered: