speed up logical indexing by bitarray #29746

chethega · 2018-10-20T22:52:57Z

The recent PR #29660 indicated that it is possible to significantly speed up logical indexing by bitarrays. So I consulted my Agner Fog, and wrote something faster. Also, this will need benchmarking on a variety of architectures, since it depends on the BLSR and TZCNT instructions. The timings are on a 2 GHZ Broadwell, so multiply by two to get cycles / selected index.

Setup:

julia> using BenchmarkTools, Random, Statistics
julia> function bench_getindex(d)
           for r in [0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.9, 0.95, 0.99]
               m = d.<r 
               b = @benchmark(getindex($d,$m))
               bt = median(b.times)
               r = count(m)/length(d)
               println("logical indexing with $(r)=$(count(m))/$(length(d)) indices selected:")
               println("\t $(bt/1000) us = $(bt/count(m)) ns per selected index.\n")
           end
           nothing
       end;
julia> Random.seed!(23);
julia> d=rand(101*101);

Before:

julia> bench_getindex(d)
logical indexing with 0.008822664444662289=90/10201 indices selected:
	 10.213 us = 113.47777777777777 ns per selected index.

logical indexing with 0.04548573669248113=464/10201 indices selected:
	 14.399 us = 31.032327586206897 ns per selected index.

logical indexing with 0.09469659837270855=966/10201 indices selected:
	 18.122 us = 18.759834368530022 ns per selected index.

logical indexing with 0.19125575923929028=1951/10201 indices selected:
	 26.742 us = 13.706817016914403 ns per selected index.

logical indexing with 0.48514851485148514=4949/10201 indices selected:
	 59.125 us = 11.946857951101233 ns per selected index.

logical indexing with 0.7932555631800804=8092/10201 indices selected:
	 35.496 us = 4.38655462184874 ns per selected index.

logical indexing with 0.8954024115282816=9134/10201 indices selected:
	 28.356 us = 3.1044449310269324 ns per selected index.

logical indexing with 0.9454955396529752=9645/10201 indices selected:
	 25.62 us = 2.656298600311042 ns per selected index.

logical indexing with 0.9875502401725321=10074/10201 indices selected:
	 22.839 us = 2.267123287671233 ns per selected index.

after:

julia> bench_getindex(d)
logical indexing with 0.008822664444662289=90/10201 indices selected:
	 1.0089285714285714 us = 11.21031746031746 ns per selected index.

logical indexing with 0.04548573669248113=464/10201 indices selected:
	 2.0676 us = 4.4560344827586205 ns per selected index.

logical indexing with 0.09469659837270855=966/10201 indices selected:
	 2.9522222222222223 us = 3.056130664826317 ns per selected index.

logical indexing with 0.19125575923929028=1951/10201 indices selected:
	 5.131833333333333 us = 2.630360498889458 ns per selected index.

logical indexing with 0.48514851485148514=4949/10201 indices selected:
	 9.664 us = 1.952717720751667 ns per selected index.

logical indexing with 0.7932555631800804=8092/10201 indices selected:
	 13.934 us = 1.7219476025704399 ns per selected index.

logical indexing with 0.8954024115282816=9134/10201 indices selected:
	 15.2875 us = 1.673691701335669 ns per selected index.

logical indexing with 0.9454955396529752=9645/10201 indices selected:
	 16.04 us = 1.663037843442198 ns per selected index.

logical indexing with 0.9875502401725321=10074/10201 indices selected:
	 16.678 us = 1.6555489378598371 ns per selected index.

The main difference to my proposal in #29660 is the following: We need one TZCNT per index, which has 3 cycles latency. We can avoid long dependency chains on TZCNT by using BLSR instead, with one cycle latency. Without this trick, this slows down significantly. I still don't get how this adds up to 3-5 cycles per index, but that's as fast as I could make it.

The main difference to the original code is that we replace hard to predict branches by dependency chains.

chethega · 2018-10-21T01:21:58Z

So if anyone wants to benchmark alternatives on other architectures:

julia> @inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, s=1)
           Bc = L.mask.chunks
           i1,i2 = Base.get_chunks_id(s)
           while i1 <= length(Bc)
               @inbounds v = Bc[i1]>>i2
               tz = trailing_zeros(v)
               (tz != 64) && return (s+tz, s+tz+1)
               s = i1<<6 + 1
               i1 += 1
               i2 = 0
           end
           return nothing
       end

julia> bench_getindex(d)
logical indexing with 0.008822664444662289=90/10201 indices selected:
	 0.7331969696969698 us = 8.146632996632997 ns per selected index.

logical indexing with 0.04548573669248113=464/10201 indices selected:
	 3.132625 us = 6.751346982758621 ns per selected index.

logical indexing with 0.09469659837270855=966/10201 indices selected:
	 6.1372 us = 6.353209109730849 ns per selected index.

logical indexing with 0.19125575923929028=1951/10201 indices selected:
	 12.173 us = 6.239364428498206 ns per selected index.

logical indexing with 0.48514851485148514=4949/10201 indices selected:
	 28.035 us = 5.664780763790665 ns per selected index.

logical indexing with 0.7932555631800804=8092/10201 indices selected:
	 44.442 us = 5.49209095402867 ns per selected index.

logical indexing with 0.8954024115282816=9134/10201 indices selected:
	 49.742 us = 5.445806875410554 ns per selected index.

logical indexing with 0.9454955396529752=9645/10201 indices selected:
	 52.099 us = 5.4016588906169 ns per selected index.

logical indexing with 0.9875502401725321=10074/10201 indices selected:
	 54.266 us = 5.386738137780425 ns per selected index.

and

julia> @inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray})
           L.sum == 0 && return nothing
           Bc = L.mask.chunks
           return Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, (1, @inbounds Bc[1]))
       end

julia> @inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, sc)
               s, v = sc
               tz = trailing_zeros(v)
               (tz != 64) && return (s+tz, (s+tz+1, v>> (tz+1)))
               Bc = L.mask.chunks

               i1,i2 = Base.get_chunks_id(s)
               while i1 <= length(Bc)
                   @inbounds v = Bc[i1]>>i2
                   tz = trailing_zeros(v)
                   (tz != 64) && return (s+tz, (s+tz+1, v>>(tz+1)))
                   s = i1<<6 + 1
                   i1 += 1
                   i2 = 0
               end
               return nothing
           end

julia> bench_getindex(d)
logical indexing with 0.008822664444662289=90/10201 indices selected:
	 1.4456 us = 16.06222222222222 ns per selected index.

logical indexing with 0.04548573669248113=464/10201 indices selected:
	 4.510714285714285 us = 9.721366995073891 ns per selected index.

logical indexing with 0.09469659837270855=966/10201 indices selected:
	 8.08475 us = 8.369306418219463 ns per selected index.

logical indexing with 0.19125575923929028=1951/10201 indices selected:
	 13.423 us = 6.880061506919528 ns per selected index.

logical indexing with 0.48514851485148514=4949/10201 indices selected:
	 24.435 us = 4.93736108304708 ns per selected index.

logical indexing with 0.7932555631800804=8092/10201 indices selected:
	 36.669 us = 4.531512605042017 ns per selected index.

logical indexing with 0.8954024115282816=9134/10201 indices selected:
	 40.588 us = 4.443617254215021 ns per selected index.

logical indexing with 0.9454955396529752=9645/10201 indices selected:
	 42.436 us = 4.399792638672888 ns per selected index.

logical indexing with 0.9875502401725321=10074/10201 indices selected:
	 44.131 us = 4.380682946198134 ns per selected index.

Ultimately, this will probably be a tradeoff between which architectures get the better code, as well as whether dense or sparse selection is optimized. The version in the PR has the advantage of being a strict improvement over the old code (on my machine). I have no idea what is fast on ARM, and Agner Fog's table suggests that BLSR is no good on AMD.

chethega · 2018-10-22T10:29:23Z

@mbauman Can you trigger a nanosoldier run?

I think the travis-ci fail is travis' problem, not this PR's problem. Unless there are change requests, I'm done with this PR, which supersedes #29660. The collect->findall specialization is deliberately missing because the code here turns out to be significantly faster than findall. I don't know how to make findall fast without calling a slow linear->cartesian, so that will be a separate PR unless someone comes up with an idea.

PS: Sorry that this comment showed up multiple times. No idea why github or my browser messed this up.

mbauman · 2018-10-22T14:20:56Z

This is awesome, thank you!

@nanosoldier runbenchmarks(ALL, vs = ":master")

mbauman · 2018-10-22T14:27:07Z

base/multidimensional.jl

+    Bc = L.mask.chunks
+    i1, c = s
+    while c==0
+        i1 == length(Bc) && return nothing


Instead of @propagate_inbounds, it'd be a bit nicer to do i1 >= length(Bc) (or even that unsigned trick to account for negative i1s). Then we could just always use @inbounds Bc[i1].

I am not entirely sure about the desired safety here. If the iteration interface is used correctly, this will always be be inbounds. If we get passed an OOB-state, do we need to raise a correct error, are we allowed to return garbage (as long as no OOB read occurs) or are we allowed to perform an OOB read (possibly with segfault) and then return garbage?

The TEST instead of CMP was actually cribbed from the iteration code for vectors. Benchmarks and Agner Fog agree that there is no price difference between these instructions. I could change to i1 >= length(Bc), with the disadvantage that misuse silently returns nothing instead of throwing an OOB error in julia --check-bounds=yes, and the advantage that we won't read OOB on most misuses, even in julia --check-bounds=no. I think we should prefer the OOB error?

For some reason the UInt version is much slower. I tried the following, with no success:

@inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}) L.sum == 0 && return nothing Bc = L.mask.chunks return Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, (UInt(0), @inbounds Bc[1])) end @inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, s::Tuple{UInt,UInt64}) Bc = L.mask.chunks i1, c = s while c==0 i1 += 1 i1 < length(Bc) || return nothing @inbounds c = Bc[i1+1] end tz = trailing_zeros(c)+1 c = _BLSR(c) return (i1<<6 +tz, (i1, c)) end

with the disadvantage that misuse silently returns nothing instead of throwing an OOB error

julia> iterate([1], 2) julia> iterate((1,), 2)

totally fine to return nothing.

Ok, I got a fast UInt version now. Only question is about what is our preferred error behavior: nothing vs OOB error with check-bounds and OOB read with possible segfault when given an @inbounds invalid state. I still prefer the OOB error, tbh. Do we have a policy for that kind of question?

@inline _BLSR(x)= x & (x-1) @inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}) L.sum == 0 && return nothing Bc = L.mask.chunks return Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, (1, @inbounds Bc[1])) end @inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, s) Bc = L.mask.chunks i1, c = s while c==0 i1%UInt >= length(Bc)%UInt && return nothing i1 += 1 @inbounds c = Bc[i1] end tz = trailing_zeros(c)+1 c = _BLSR(c) return ((i1-1)<<6 +tz, (i1, c)) end

"When in doubt, do as the Arrays".

So, couldn't just something like

julia/base/array.jl

Line 707 in 0e023d0

iterate(A::Array, i=1) = (@_inline_meta; (i % UInt) - 1 < length(A) ? (@inbounds A[i], i + 1) : nothing)

be used?

Arrays currently give nothing for OOB states, and ranges silently return garbage:

julia> iterate(1:10, 17) (18, 18) julia> iterate(collect(1:10), 17) === nothing true

In view of that, I guess it would be consistent enough to return nothing and reap the speedup when there is no @inbounds annotation. So I'll change that.

mbauman · 2018-10-22T14:28:36Z

base/multidimensional.jl

    end
+    tz = trailing_zeros(c) + 1
+    c = _BLSR(c)


This is a really clever idea. Nice work.

StefanKarpinski · 2018-10-22T20:57:58Z

base/bitarray.jl

@@ -84,6 +84,7 @@ IndexStyle(::Type{<:BitArray}) = IndexLinear()
 const _msk64 = ~UInt64(0)
 @inline _div64(l) = l >> 6
 @inline _mod64(l) = l & 63
+@inline _BLSR(x)= x & (x-1) #zeros the last set bit. Has native instruction on many archs. needed in multidimensional.jl


The pickiest possible nit: maybe _blsr to match the surrounding capitalization?

switched spelling of _blsr

nanosoldier · 2018-10-23T01:51:28Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

mbauman · 2018-10-23T20:09:54Z

Hm, strange that this seems to show both time and memory regression in Nanosoldier. They seem to be real:

julia> function perf_sumlogical(A)
           s = zero(eltype(A))
           nrows = size(A, 1)
           ncols = size(A, 2)
           r = falses(nrows)
           r[1:4:end] .= true
           @simd for i = 1:ncols
               val = @inbounds A[r, i]
               s += first(val)
           end
           return s
       end
perf_sumlogical (generic function with 1 method)

julia> using BenchmarkTools

julia> A = rand(100,100);

julia> @btime perf_sumlogical($A);
  24.910 μs (203 allocations: 31.44 KiB)

julia> @eval Base begin
       @inline _blsr(x)= x & (x-1)
       @inline function iterate(L::Base.LogicalIndex{Int,<:BitArray})
           L.sum == 0 && return nothing
           Bc = L.mask.chunks
           return iterate(L::Base.LogicalIndex{Int,<:BitArray}, (1, @inbounds Bc[1]))
       end
       @inline function iterate(L::Base.LogicalIndex{Int,<:BitArray}, s)
           Bc = L.mask.chunks
           i1, c = s
           while c==0
               i1 % UInt >= length(Bc) % UInt && return nothing
               i1 += 1
               @inbounds c = Bc[i1]
           end
           tz = trailing_zeros(c) + 1
           c = _blsr(c)
           return ((i1-1)<<6 + tz, (i1, c))
       end
       end
iterate (generic function with 201 methods)

julia> @btime perf_sumlogical($A);
  86.771 μs (603 allocations: 51.75 KiB)

The culprit seems to be the type assertion I've tagged below. Remove it and:

julia> @btime perf_sumlogical($A);
  14.063 μs (203 allocations: 31.44 KiB)

base/multidimensional.jl

Co-Authored-By: chethega <[email protected]>

mbauman · 2018-10-24T01:44:39Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2018-10-24T08:34:35Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

* speed up logical indexing by bitarray * changed inbounds / OOB behavior to match array iterators switched spelling of _blsr * Update base/multidimensional.jl Co-Authored-By: chethega <[email protected]> (cherry picked from commit 44f2563)

chethega force-pushed the bitarray_logindex branch 2 times, most recently from 1f81b7b to 17fd485 Compare October 20, 2018 23:33

speed up logical indexing by bitarray

17fd485

fredrikekre added performance Must go faster arrays [a, r, r, a, y, s] labels Oct 22, 2018

fredrikekre requested a review from mbauman October 22, 2018 10:30

mbauman reviewed Oct 22, 2018

View reviewed changes

base/multidimensional.jl Outdated

end

tz = trailing_zeros(c) + 1

c = _BLSR(c)

Copy link

Member

mbauman Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really clever idea. Nice work.

mbauman mentioned this pull request Oct 22, 2018

Improve performance of logical indexing with BitVectors #29660

Closed

StefanKarpinski reviewed Oct 22, 2018

View reviewed changes

chethega force-pushed the bitarray_logindex branch from 924f1a1 to ada8bc8 Compare October 22, 2018 21:24

changed inbounds / OOB behavior to match array iterators

ada8bc8

switched spelling of _blsr

mbauman reviewed Oct 23, 2018

View reviewed changes

base/multidimensional.jl Outdated Show resolved Hide resolved

Update base/multidimensional.jl

1328295

Co-Authored-By: chethega <[email protected]>

mbauman merged commit 44f2563 into JuliaLang:master Oct 24, 2018

KristofferC added the backport pending 1.0 label Oct 29, 2018

KristofferC mentioned this pull request Oct 29, 2018

WIP: Backports for 1.0.2 #29444

Merged

maxbennedich mentioned this pull request Nov 1, 2018

Faster findall for bitarrays #29888

Merged

KristofferC removed the backport pending 1.0 label Nov 12, 2018

KristofferC mentioned this pull request Nov 30, 2018

Performance of logical indexing in Julia 0.7 #29418

Closed

chethega mentioned this pull request Jun 24, 2019

RobinDict JuliaCollections/DataStructures.jl#501

Merged

5 tasks

rfourquet mentioned this pull request Oct 28, 2019

speedup BitSet iteration (by 2x or 3x) #33690

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up logical indexing by bitarray #29746

speed up logical indexing by bitarray #29746

chethega commented Oct 20, 2018 •

edited

Loading

chethega commented Oct 21, 2018

chethega commented Oct 22, 2018 •

edited

Loading

mbauman commented Oct 22, 2018

mbauman Oct 22, 2018

chethega Oct 22, 2018

KristofferC Oct 22, 2018

chethega Oct 22, 2018

KristofferC Oct 22, 2018

chethega Oct 22, 2018

mbauman Oct 22, 2018

StefanKarpinski Oct 22, 2018

chethega Oct 22, 2018

nanosoldier commented Oct 23, 2018

mbauman commented Oct 23, 2018

mbauman commented Oct 24, 2018

nanosoldier commented Oct 24, 2018

speed up logical indexing by bitarray #29746

speed up logical indexing by bitarray #29746

Conversation

chethega commented Oct 20, 2018 • edited Loading

chethega commented Oct 21, 2018

chethega commented Oct 22, 2018 • edited Loading

mbauman commented Oct 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nanosoldier commented Oct 23, 2018

mbauman commented Oct 23, 2018

mbauman commented Oct 24, 2018

nanosoldier commented Oct 24, 2018

chethega commented Oct 20, 2018 •

edited

Loading

chethega commented Oct 22, 2018 •

edited

Loading