use empirically found threshold to speed up issubset by creating Set … #26198

twistedcubic · 2018-02-25T00:34:21Z

This plot shows the timing comparison between the hash-based O(m + n) issubset, vs the double for-loop O(mn) approach, where m = length(n) and n = length(m).

This reveals a threshold size at which the constant-time lookup of the hashset overtakes the overhead imposed by creating that set.

The x-axis shows the number of elements in the collection to check membership against (the right hand side). The y-axis shows time in seconds.

It shows the necessity to use hashing for efficient performance. This addresses #24624.

Various timing experiments reveal the threshold size to be ~70. As shown:

JeffBezanson · 2018-02-27T22:21:31Z

base/abstractset.jl

+    #sampling using these two methods.
+    lenthresh = 70
+
+    if rlen > lenthresh && !isa(r, Set)


It occurred to me that this should probably use isa(r, AbstractSet) (there are other kinds of sets, all of which should probably have efficient in methods).

Will do. I just learned that IntSet is implemented as a bit string.

nalimilan · 2018-02-27T22:42:48Z

Thanks for doing this benchmarking. Does the threshold depend on the element type? I would imagine that's the case. Could you document which type you used here? I guess Int? Complex types like String would likely benefit even more from using a Dict, meaning an even lower threshold could be chosen.

twistedcubic · 2018-02-27T23:15:45Z

@nalimilan the benchmarks are done with strings, in particular strings of integers, e.g. "1", "2", ... . I will repeat the benchmark for different types to get a general feel, I agree the timing most likely depends on the type.

I can see the constant time lookup can benefit complex types like Strings even more, though in that case I believe constructing Sets for Strings also takes longer, since Int can be its own hash (I will double check with the source).

JeffBezanson · 2018-02-27T23:21:41Z

We don't hash Ints to themselves, but computing that is still much faster than hashing a String.

nalimilan · 2018-02-28T10:42:38Z

OK. The length of the string will probably matter a lot too.

actually resolving the merge conflict correctly

6764732

JeffBezanson reviewed Feb 27, 2018

View reviewed changes

JeffBezanson added performance Must go faster collections Data structures holding multiple items, e.g. sets labels Feb 27, 2018

update Set to AbstractSet to cover all Sets, per Jeff's suggestion

2ea5fa6

JeffBezanson merged commit 4acc345 into JuliaLang:master Mar 7, 2018

KristofferC mentioned this pull request Apr 5, 2018

Poor performance of intersect(Vector{Int}, Vector{Int}) #13675

Closed

KristofferC mentioned this pull request May 8, 2018

Benchmarks vs 0.6 in prep for 0.7 release [do not merge] #27030

Closed

timholy mentioned this pull request Aug 24, 2018

issubset aces Discrete Math but flunks Topology #28871

Merged

mcabbott mentioned this pull request Jul 1, 2019

efficient subset query for ranges #32461

Closed

bermani mentioned this pull request Jul 12, 2019

issetequal behavior with duplicate elements #32550

Closed

lgoettgens mentioned this pull request Jul 31, 2021

issetequal and issubset being weird with custom == on structs #41748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use empirically found threshold to speed up issubset by creating Set … #26198

use empirically found threshold to speed up issubset by creating Set … #26198

twistedcubic commented Feb 25, 2018

JeffBezanson Feb 27, 2018

twistedcubic Feb 27, 2018

nalimilan commented Feb 27, 2018

twistedcubic commented Feb 27, 2018

JeffBezanson commented Feb 27, 2018

nalimilan commented Feb 28, 2018

use empirically found threshold to speed up issubset by creating Set … #26198

use empirically found threshold to speed up issubset by creating Set … #26198

Conversation

twistedcubic commented Feb 25, 2018

JeffBezanson Feb 27, 2018

Choose a reason for hiding this comment

twistedcubic Feb 27, 2018

Choose a reason for hiding this comment

nalimilan commented Feb 27, 2018

twistedcubic commented Feb 27, 2018

JeffBezanson commented Feb 27, 2018

nalimilan commented Feb 28, 2018