Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up logical indexing by bitarray #29746

Merged
merged 3 commits into from
Oct 24, 2018

Conversation

chethega
Copy link
Contributor

@chethega chethega commented Oct 20, 2018

The recent PR #29660 indicated that it is possible to significantly speed up logical indexing by bitarrays. So I consulted my Agner Fog, and wrote something faster. Also, this will need benchmarking on a variety of architectures, since it depends on the BLSR and TZCNT instructions. The timings are on a 2 GHZ Broadwell, so multiply by two to get cycles / selected index.

Setup:

julia> using BenchmarkTools, Random, Statistics
julia> function bench_getindex(d)
           for r in [0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.9, 0.95, 0.99]
               m = d.<r 
               b = @benchmark(getindex($d,$m))
               bt = median(b.times)
               r = count(m)/length(d)
               println("logical indexing with $(r)=$(count(m))/$(length(d)) indices selected:")
               println("\t $(bt/1000) us = $(bt/count(m)) ns per selected index.\n")
           end
           nothing
       end;
julia> Random.seed!(23);
julia> d=rand(101*101);

Before:

julia> bench_getindex(d)
logical indexing with 0.008822664444662289=90/10201 indices selected:
	 10.213 us = 113.47777777777777 ns per selected index.

logical indexing with 0.04548573669248113=464/10201 indices selected:
	 14.399 us = 31.032327586206897 ns per selected index.

logical indexing with 0.09469659837270855=966/10201 indices selected:
	 18.122 us = 18.759834368530022 ns per selected index.

logical indexing with 0.19125575923929028=1951/10201 indices selected:
	 26.742 us = 13.706817016914403 ns per selected index.

logical indexing with 0.48514851485148514=4949/10201 indices selected:
	 59.125 us = 11.946857951101233 ns per selected index.

logical indexing with 0.7932555631800804=8092/10201 indices selected:
	 35.496 us = 4.38655462184874 ns per selected index.

logical indexing with 0.8954024115282816=9134/10201 indices selected:
	 28.356 us = 3.1044449310269324 ns per selected index.

logical indexing with 0.9454955396529752=9645/10201 indices selected:
	 25.62 us = 2.656298600311042 ns per selected index.

logical indexing with 0.9875502401725321=10074/10201 indices selected:
	 22.839 us = 2.267123287671233 ns per selected index.

after:

julia> bench_getindex(d)
logical indexing with 0.008822664444662289=90/10201 indices selected:
	 1.0089285714285714 us = 11.21031746031746 ns per selected index.

logical indexing with 0.04548573669248113=464/10201 indices selected:
	 2.0676 us = 4.4560344827586205 ns per selected index.

logical indexing with 0.09469659837270855=966/10201 indices selected:
	 2.9522222222222223 us = 3.056130664826317 ns per selected index.

logical indexing with 0.19125575923929028=1951/10201 indices selected:
	 5.131833333333333 us = 2.630360498889458 ns per selected index.

logical indexing with 0.48514851485148514=4949/10201 indices selected:
	 9.664 us = 1.952717720751667 ns per selected index.

logical indexing with 0.7932555631800804=8092/10201 indices selected:
	 13.934 us = 1.7219476025704399 ns per selected index.

logical indexing with 0.8954024115282816=9134/10201 indices selected:
	 15.2875 us = 1.673691701335669 ns per selected index.

logical indexing with 0.9454955396529752=9645/10201 indices selected:
	 16.04 us = 1.663037843442198 ns per selected index.

logical indexing with 0.9875502401725321=10074/10201 indices selected:
	 16.678 us = 1.6555489378598371 ns per selected index.

The main difference to my proposal in #29660 is the following: We need one TZCNT per index, which has 3 cycles latency. We can avoid long dependency chains on TZCNT by using BLSR instead, with one cycle latency. Without this trick, this slows down significantly. I still don't get how this adds up to 3-5 cycles per index, but that's as fast as I could make it.

The main difference to the original code is that we replace hard to predict branches by dependency chains.

@chethega chethega force-pushed the bitarray_logindex branch 2 times, most recently from 1f81b7b to 17fd485 Compare October 20, 2018 23:33
@chethega
Copy link
Contributor Author

So if anyone wants to benchmark alternatives on other architectures:

julia> @inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, s=1)
           Bc = L.mask.chunks
           i1,i2 = Base.get_chunks_id(s)
           while i1 <= length(Bc)
               @inbounds v = Bc[i1]>>i2
               tz = trailing_zeros(v)
               (tz != 64) && return (s+tz, s+tz+1)
               s = i1<<6 + 1
               i1 += 1
               i2 = 0
           end
           return nothing
       end

julia> bench_getindex(d)
logical indexing with 0.008822664444662289=90/10201 indices selected:
	 0.7331969696969698 us = 8.146632996632997 ns per selected index.

logical indexing with 0.04548573669248113=464/10201 indices selected:
	 3.132625 us = 6.751346982758621 ns per selected index.

logical indexing with 0.09469659837270855=966/10201 indices selected:
	 6.1372 us = 6.353209109730849 ns per selected index.

logical indexing with 0.19125575923929028=1951/10201 indices selected:
	 12.173 us = 6.239364428498206 ns per selected index.

logical indexing with 0.48514851485148514=4949/10201 indices selected:
	 28.035 us = 5.664780763790665 ns per selected index.

logical indexing with 0.7932555631800804=8092/10201 indices selected:
	 44.442 us = 5.49209095402867 ns per selected index.

logical indexing with 0.8954024115282816=9134/10201 indices selected:
	 49.742 us = 5.445806875410554 ns per selected index.

logical indexing with 0.9454955396529752=9645/10201 indices selected:
	 52.099 us = 5.4016588906169 ns per selected index.

logical indexing with 0.9875502401725321=10074/10201 indices selected:
	 54.266 us = 5.386738137780425 ns per selected index.

and

julia> @inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray})
           L.sum == 0 && return nothing
           Bc = L.mask.chunks
           return Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, (1, @inbounds Bc[1]))
       end

julia> @inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, sc)
               s, v = sc
               tz = trailing_zeros(v)
               (tz != 64) && return (s+tz, (s+tz+1, v>> (tz+1)))
               Bc = L.mask.chunks

               i1,i2 = Base.get_chunks_id(s)
               while i1 <= length(Bc)
                   @inbounds v = Bc[i1]>>i2
                   tz = trailing_zeros(v)
                   (tz != 64) && return (s+tz, (s+tz+1, v>>(tz+1)))
                   s = i1<<6 + 1
                   i1 += 1
                   i2 = 0
               end
               return nothing
           end

julia> bench_getindex(d)
logical indexing with 0.008822664444662289=90/10201 indices selected:
	 1.4456 us = 16.06222222222222 ns per selected index.

logical indexing with 0.04548573669248113=464/10201 indices selected:
	 4.510714285714285 us = 9.721366995073891 ns per selected index.

logical indexing with 0.09469659837270855=966/10201 indices selected:
	 8.08475 us = 8.369306418219463 ns per selected index.

logical indexing with 0.19125575923929028=1951/10201 indices selected:
	 13.423 us = 6.880061506919528 ns per selected index.

logical indexing with 0.48514851485148514=4949/10201 indices selected:
	 24.435 us = 4.93736108304708 ns per selected index.

logical indexing with 0.7932555631800804=8092/10201 indices selected:
	 36.669 us = 4.531512605042017 ns per selected index.

logical indexing with 0.8954024115282816=9134/10201 indices selected:
	 40.588 us = 4.443617254215021 ns per selected index.

logical indexing with 0.9454955396529752=9645/10201 indices selected:
	 42.436 us = 4.399792638672888 ns per selected index.

logical indexing with 0.9875502401725321=10074/10201 indices selected:
	 44.131 us = 4.380682946198134 ns per selected index.

Ultimately, this will probably be a tradeoff between which architectures get the better code, as well as whether dense or sparse selection is optimized. The version in the PR has the advantage of being a strict improvement over the old code (on my machine). I have no idea what is fast on ARM, and Agner Fog's table suggests that BLSR is no good on AMD.

@chethega
Copy link
Contributor Author

chethega commented Oct 22, 2018

@mbauman Can you trigger a nanosoldier run?

I think the travis-ci fail is travis' problem, not this PR's problem. Unless there are change requests, I'm done with this PR, which supersedes #29660. The collect->findall specialization is deliberately missing because the code here turns out to be significantly faster than findall. I don't know how to make findall fast without calling a slow linear->cartesian, so that will be a separate PR unless someone comes up with an idea.

PS: Sorry that this comment showed up multiple times. No idea why github or my browser messed this up.

@fredrikekre fredrikekre added performance Must go faster arrays [a, r, r, a, y, s] labels Oct 22, 2018
@fredrikekre fredrikekre requested a review from mbauman October 22, 2018 10:30
@mbauman
Copy link
Member

mbauman commented Oct 22, 2018

This is awesome, thank you!

@nanosoldier runbenchmarks(ALL, vs = ":master")

Bc = L.mask.chunks
i1, c = s
while c==0
i1 == length(Bc) && return nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of @propagate_inbounds, it'd be a bit nicer to do i1 >= length(Bc) (or even that unsigned trick to account for negative i1s). Then we could just always use @inbounds Bc[i1].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure about the desired safety here. If the iteration interface is used correctly, this will always be be inbounds. If we get passed an OOB-state, do we need to raise a correct error, are we allowed to return garbage (as long as no OOB read occurs) or are we allowed to perform an OOB read (possibly with segfault) and then return garbage?

The TEST instead of CMP was actually cribbed from the iteration code for vectors. Benchmarks and Agner Fog agree that there is no price difference between these instructions. I could change to i1 >= length(Bc), with the disadvantage that misuse silently returns nothing instead of throwing an OOB error in julia --check-bounds=yes, and the advantage that we won't read OOB on most misuses, even in julia --check-bounds=no. I think we should prefer the OOB error?

For some reason the UInt version is much slower. I tried the following, with no success:

@inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray})
    L.sum == 0 && return nothing
    Bc = L.mask.chunks
    return Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, (UInt(0), @inbounds Bc[1]))
end


@inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, s::Tuple{UInt,UInt64})
    Bc = L.mask.chunks
    i1, c = s
    while c==0 
        i1 += 1
        i1 < length(Bc) || return nothing        
        @inbounds c = Bc[i1+1] 
    end
    tz = trailing_zeros(c)+1
    c = _BLSR(c)
    return (i1<<6 +tz, (i1, c))
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the disadvantage that misuse silently returns nothing instead of throwing an OOB error

julia> iterate([1], 2)

julia> iterate((1,), 2)

totally fine to return nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got a fast UInt version now. Only question is about what is our preferred error behavior: nothing vs OOB error with check-bounds and OOB read with possible segfault when given an @inbounds invalid state. I still prefer the OOB error, tbh. Do we have a policy for that kind of question?

@inline _BLSR(x)= x & (x-1) 
@inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray})
    L.sum == 0 && return nothing
    Bc = L.mask.chunks
    return Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, (1, @inbounds Bc[1]))
end

@inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, s)
    Bc = L.mask.chunks
    i1, c = s
    while c==0 
        i1%UInt >= length(Bc)%UInt && return nothing
        i1 += 1
        @inbounds c = Bc[i1] 
    end
    tz = trailing_zeros(c)+1
    c = _BLSR(c)
    return ((i1-1)<<6 +tz, (i1, c))
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When in doubt, do as the Arrays".

So, couldn't just something like

iterate(A::Array, i=1) = (@_inline_meta; (i % UInt) - 1 < length(A) ? (@inbounds A[i], i + 1) : nothing)

be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays currently give nothing for OOB states, and ranges silently return garbage:

julia> iterate(1:10, 17)
(18, 18)
julia> iterate(collect(1:10), 17) === nothing
true

In view of that, I guess it would be consistent enough to return nothing and reap the speedup when there is no @inbounds annotation. So I'll change that.

end
tz = trailing_zeros(c) + 1
c = _BLSR(c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really clever idea. Nice work.

base/bitarray.jl Outdated
@@ -84,6 +84,7 @@ IndexStyle(::Type{<:BitArray}) = IndexLinear()
const _msk64 = ~UInt64(0)
@inline _div64(l) = l >> 6
@inline _mod64(l) = l & 63
@inline _BLSR(x)= x & (x-1) #zeros the last set bit. Has native instruction on many archs. needed in multidimensional.jl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pickiest possible nit: maybe _blsr to match the surrounding capitalization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@mbauman
Copy link
Member

mbauman commented Oct 23, 2018

Hm, strange that this seems to show both time and memory regression in Nanosoldier. They seem to be real:

julia> function perf_sumlogical(A)
           s = zero(eltype(A))
           nrows = size(A, 1)
           ncols = size(A, 2)
           r = falses(nrows)
           r[1:4:end] .= true
           @simd for i = 1:ncols
               val = @inbounds A[r, i]
               s += first(val)
           end
           return s
       end
perf_sumlogical (generic function with 1 method)

julia> using BenchmarkTools

julia> A = rand(100,100);

julia> @btime perf_sumlogical($A);
  24.910 μs (203 allocations: 31.44 KiB)

julia> @eval Base begin
       @inline _blsr(x)= x & (x-1)
       @inline function iterate(L::Base.LogicalIndex{Int,<:BitArray})
           L.sum == 0 && return nothing
           Bc = L.mask.chunks
           return iterate(L::Base.LogicalIndex{Int,<:BitArray}, (1, @inbounds Bc[1]))
       end
       @inline function iterate(L::Base.LogicalIndex{Int,<:BitArray}, s)
           Bc = L.mask.chunks
           i1, c = s
           while c==0
               i1 % UInt >= length(Bc) % UInt && return nothing
               i1 += 1
               @inbounds c = Bc[i1]
           end
           tz = trailing_zeros(c) + 1
           c = _blsr(c)
           return ((i1-1)<<6 + tz, (i1, c))
       end
       end
iterate (generic function with 201 methods)

julia> @btime perf_sumlogical($A);
  86.771 μs (603 allocations: 51.75 KiB)

The culprit seems to be the type assertion I've tagged below. Remove it and:

julia> @btime perf_sumlogical($A);
  14.063 μs (203 allocations: 31.44 KiB)

base/multidimensional.jl Outdated Show resolved Hide resolved
@mbauman
Copy link
Member

mbauman commented Oct 24, 2018

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@mbauman mbauman merged commit 44f2563 into JuliaLang:master Oct 24, 2018
KristofferC pushed a commit that referenced this pull request Oct 29, 2018
* speed up logical indexing by bitarray

* changed inbounds / OOB behavior to match array iterators

switched spelling of _blsr

* Update base/multidimensional.jl

Co-Authored-By: chethega <[email protected]>
(cherry picked from commit 44f2563)
KristofferC pushed a commit that referenced this pull request Oct 31, 2018
* speed up logical indexing by bitarray

* changed inbounds / OOB behavior to match array iterators

switched spelling of _blsr

* Update base/multidimensional.jl

Co-Authored-By: chethega <[email protected]>
(cherry picked from commit 44f2563)
KristofferC pushed a commit that referenced this pull request Nov 2, 2018
* speed up logical indexing by bitarray

* changed inbounds / OOB behavior to match array iterators

switched spelling of _blsr

* Update base/multidimensional.jl

Co-Authored-By: chethega <[email protected]>
(cherry picked from commit 44f2563)
KristofferC pushed a commit that referenced this pull request Feb 11, 2019
* speed up logical indexing by bitarray

* changed inbounds / OOB behavior to match array iterators

switched spelling of _blsr

* Update base/multidimensional.jl

Co-Authored-By: chethega <[email protected]>
(cherry picked from commit 44f2563)
KristofferC pushed a commit that referenced this pull request Feb 20, 2020
* speed up logical indexing by bitarray

* changed inbounds / OOB behavior to match array iterators

switched spelling of _blsr

* Update base/multidimensional.jl

Co-Authored-By: chethega <[email protected]>
(cherry picked from commit 44f2563)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants