Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use empirically found threshold to speed up issubset by creating Set … #26198

Merged
merged 2 commits into from
Mar 7, 2018
Merged

use empirically found threshold to speed up issubset by creating Set … #26198

merged 2 commits into from
Mar 7, 2018

Conversation

twistedcubic
Copy link
Contributor

This plot shows the timing comparison between the hash-based O(m + n) issubset, vs the double for-loop O(mn) approach, where m = length(n) and n = length(m).

This reveals a threshold size at which the constant-time lookup of the hashset overtakes the overhead imposed by creating that set.

The x-axis shows the number of elements in the collection to check membership against (the right hand side). The y-axis shows time in seconds.

issubsettiming5000

It shows the necessity to use hashing for efficient performance. This addresses #24624.

Various timing experiments reveal the threshold size to be ~70. As shown:

issubsettimingthreshold

#sampling using these two methods.
lenthresh = 70

if rlen > lenthresh && !isa(r, Set)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me that this should probably use isa(r, AbstractSet) (there are other kinds of sets, all of which should probably have efficient in methods).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I just learned that IntSet is implemented as a bit string.

@nalimilan
Copy link
Member

Thanks for doing this benchmarking. Does the threshold depend on the element type? I would imagine that's the case. Could you document which type you used here? I guess Int? Complex types like String would likely benefit even more from using a Dict, meaning an even lower threshold could be chosen.

@twistedcubic
Copy link
Contributor Author

@nalimilan the benchmarks are done with strings, in particular strings of integers, e.g. "1", "2", ... . I will repeat the benchmark for different types to get a general feel, I agree the timing most likely depends on the type.

I can see the constant time lookup can benefit complex types like Strings even more, though in that case I believe constructing Sets for Strings also takes longer, since Int can be its own hash (I will double check with the source).

@JeffBezanson
Copy link
Member

We don't hash Ints to themselves, but computing that is still much faster than hashing a String.

@JeffBezanson JeffBezanson added performance Must go faster collections Data structures holding multiple items, e.g. sets labels Feb 27, 2018
@nalimilan
Copy link
Member

OK. The length of the string will probably matter a lot too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants