-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use empirically found threshold to speed up issubset by creating Set … #26198
use empirically found threshold to speed up issubset by creating Set … #26198
Conversation
base/abstractset.jl
Outdated
#sampling using these two methods. | ||
lenthresh = 70 | ||
|
||
if rlen > lenthresh && !isa(r, Set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurred to me that this should probably use isa(r, AbstractSet)
(there are other kinds of sets, all of which should probably have efficient in
methods).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I just learned that IntSet
is implemented as a bit string.
Thanks for doing this benchmarking. Does the threshold depend on the element type? I would imagine that's the case. Could you document which type you used here? I guess |
@nalimilan the benchmarks are done with strings, in particular strings of integers, e.g. I can see the constant time lookup can benefit complex types like |
We don't hash |
OK. The length of the string will probably matter a lot too. |
This plot shows the timing comparison between the hash-based
O(m + n)
issubset, vs the double for-loopO(mn)
approach, wherem = length(n)
andn = length(m)
.This reveals a threshold size at which the constant-time lookup of the hashset overtakes the overhead imposed by creating that set.
The x-axis shows the number of elements in the collection to check membership against (the right hand side). The y-axis shows time in seconds.
It shows the necessity to use hashing for efficient performance. This addresses #24624.
Various timing experiments reveal the threshold size to be ~70. As shown: