Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster intersection of Indices #69

Open
pevnak opened this issue Oct 6, 2021 · 7 comments
Open

Faster intersection of Indices #69

pevnak opened this issue Oct 6, 2021 · 7 comments

Comments

@pevnak
Copy link

pevnak commented Oct 6, 2021

I think that the intersection of Indices can be made trivially faster by ordering of arguments.

 a = distinct(rand(Int, 10000))
 b = distinct(rand(Int, 100))
julia> @btime intersect(a, b)
  134.240 μs (8 allocations: 284.55 KiB)
0-element Indices{Int64}

julia> @btime intersect(b, a)
  1.238 μs (5 allocations: 3.94 KiB)
0-element Indices{Int64}

julia> @btime length(a) > length(b) ? intersect(b, a) : intersect(a, b)
  1.283 μs (6 allocations: 3.95 KiB)
0-element Indices{Int64}

Would it make sense?

@pevnak
Copy link
Author

pevnak commented Oct 6, 2021

This did the job for me

function Base.intersect(s1::AbstractIndices, s2::AbstractIndices)
    length(s1) > length(s2) ? filter!(in(s2), copy(s1)) : filter!(in(s1), copy(s2))
end

@andyferris
Copy link
Owner

andyferris commented Oct 6, 2021

Thanks @pevnak

One issue I had was trying to get the order of elements to come out the same either way. I don’t know if that is important, but so far I’ve been conservative with ordering issues.

Not sure what other people’s opinions are?

@pevnak
Copy link
Author

pevnak commented Oct 7, 2021

I cannot judge that. I am usually concerned with speed.

@pevnak
Copy link
Author

pevnak commented Oct 7, 2021

Since there is a stable_sort and sort, would it make sense to have something similar? I apologize for abusing the comments, but the difference in speed I am getting in my application is enormous.

I guess it might be possible to implement filtering for the opposite case, but it might be awful. One would first invalidate everything and then turn on only items that are in the other set. In this way, I guess, the iteration might go over the other set. I guess you got my idea, but not sure if it something that should be done.

@andyferris
Copy link
Owner

andyferris commented Oct 7, 2021

Yes perhaps.

I apologize for abusing the comments, but the difference in speed I am getting in my application is enormous.

There's no abuse here :) I, too, take performance seriously - we'll get this sorted.

One consideration might be, when length(s1) > length(s2), to record the tokens in s1, and sort them and give that back.

@andyferris
Copy link
Owner

Another complication is the fact that either collection might be SizeUnknown (i.e. a FilteredIndices) so even length can be expensive.

@pevnak
Copy link
Author

pevnak commented Oct 7, 2021

But it can be narrowed for types where this will be performant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants