-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster findall for bitarrays #29888
Faster findall for bitarrays #29888
Conversation
Sorry for taking so long to properly respond to this. Very cool. Your solution widely surpasses my expectation when I issued the challenge on discourse, I did not expect that this can get so fast! |
Thanks for that. I added a few tests now. What's the next step here? Would anyone like to take a stab at reviewing the code? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome. Thank you so much for the contribution. I have just a few really minor nit-picky comments, but this is really impressive and obviously a great improvement.
nnzB == 0 && return I | ||
nnzB == length(B) && (allindices!(I, B); return I) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allindices!
seems like it should be able to be faster/more generic/less code. It's a little annoying though since we don't yet have the generic Vector(itr)
constructor. Maybe it should just be vec(collect(keys(B)))
and move the short circuit return to be before you construct I
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It annoyed me too that I needed almost as many lines of code for the allindices!
functions as for findall
itself. vec(collect(keys(B)))
is a great suggestion for vectors and arrays of dim >= 3, but I am seeing much worse performance for matrices (2 dims). This is the simple test script I'm using:
for B in [trues(100000), trues(200, 200), trues(50, 50, 50), trues(16, 16, 16, 16)]
print(size(B)); @btime findall_optimized($B)
print(size(B)); @btime vec(collect(keys($B)))
end
With results:
(100000,) 56.197 μs (3 allocations: 781.38 KiB)
(100000,) 55.882 μs (3 allocations: 781.34 KiB)
(200, 200) 49.331 μs (2 allocations: 625.08 KiB)
(200, 200) 72.926 μs (5 allocations: 625.19 KiB)
(50, 50, 50) 222.002 μs (2 allocations: 2.86 MiB)
(50, 50, 50) 225.390 μs (5 allocations: 2.86 MiB)
(16, 16, 16, 16) 151.709 μs (2 allocations: 2.00 MiB)
(16, 16, 16, 16) 155.849 μs (6 allocations: 2.00 MiB)
In fact, for matrices, it would be better then to turn off this special case optimization. Timings for findall_optimized
without using allindices!
:
(100000,) 74.627 μs (2 allocations: 781.33 KiB)
(200, 200) 52.787 μs (2 allocations: 625.08 KiB)
(50, 50, 50) 234.702 μs (2 allocations: 2.86 MiB)
(16, 16, 16, 16) 165.563 μs (2 allocations: 2.00 MiB)
While I think some performance can be sacrificed for simpler code, IMO the degradation for matrices is a bit much. Can you think of a performant solution that works for arrays of all dimensions? If not, two alternatives are: 1) keep allindices!
(or _allindices!
) but with only two cases: the BitMatrix one as is, and vec(collect(keys(B)))
for all other BitArrays; or 2) make vec(collect(keys(B)))
fast for matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough testing here. I think what you have makes sense and is just fine.
base/bitarray.jl
Outdated
Icount += 1 | ||
Bs = size(B) | ||
Bi = i1 = i = 1 | ||
irest = ntuple(one, length(B.dims) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang, constant propagation is amazing — I had to check that this was type stable. I would slightly prefer ndims(B)
over length(B.dims)
— they're the same but B.dims
initially worried me since its contents are undefined for BitVectors (but of course its length is defined and so this does indeed work as you wrote it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great, wasn't aware of ndims
!
base/bitarray.jl
Outdated
end | ||
end | ||
|
||
@inline overflowind(i1, irest::Tuple{}, size) = (i1, irest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to name this _overflowind
(and toind
below to _toind
) — they're helper functions that are only relevant to this one method, but those are fairly common names and likely to be mistaken for to_indices
.
test/bitarray.jl
Outdated
@check_bit_operation findall(b1) Vector{CartesianIndex{3}} | ||
|
||
# BitArrays of various dimensions | ||
for dims = 2:8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for dims = 2:8 | |
for dims = 0:8 |
Let's also add tests for 0-dimensional arrays — they work due to the early exits, but would fail the general algorithm if that wasn't the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Had to update the code slightly to work for the 1-dimensional case.
I tried a few other sizes with your benchmarking script trying to assess the worst case scenarios… and even then this is performing spectacularly. I had to bend over backwards to find anything that's remotely a regression, and even then only in a few circumstances!
|
What remains to be done here? This PR seems to be in good shape. Is it time to merge? |
Nanosoldier to catch weird surprises, and either delight in the nice new number or add "benchmarks beneficial" tag before merging? |
@nanosoldier |
What happened to the nanosoldier run? Looking at BaseBenchmarkReports, the "daily" report has not been produced since Nov 3, so perhaps the service is having trouble? (Although there was a run completing 3 hours ago.) |
@nanosoldier |
Seems to be running now |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Hmm. It looks like we are missing benchmarks for multidimensional findall. The reported non-improvement for 90% full 1000-element bitvectors looks reproducible with
Observe the superlinear jump for 0.5 density from For N=1000, we expect 500 misses, which costs 3.75 us. The reported time of 1.7us cannot be true, unless the sneaky CPU uses the benchmark loop to learn the tested bitarray. Since each iteration tests the same pattern, well, there we go. What I believe happens is that, over the benchmark loop, we fill most of the branch history buffer space with our one critical branch. The history buffer contains counters for possible subsequences; it is apparently large enough to encode a significant fraction of our fixed test array. Benchmarking is hard! I think we can merge this and need to think about how to avoid this problem in the future: It looks like we overestimated the speed of the old findall by a factor of 3. This issue can apply for all small branchy microbenchmarks. We can either increase size of testsets, or we can interleave tests runs (such that cache and branch predictor are cold). |
So, just spinning more thoughts. This benchmarking artifact is absolutely mindblowing to me. What this means is that we have a potential spectre-type gadget: Suppose we have a situation where we can repeatedly ask the kernel to run something like the old findall on a secret buffer, and the branch history table is not cleared. Afaik current spectre mitigations only deal with the BTB and BHT poisoning, not BHT sniffing. Then we can probably reassemble large parts of 1000 bit secrets, the same way biologists assemble a genome from very short reads. Our read-length is the length of histories stored; Agner Fog suggests that they are quite long (18-32 bit). Neat! I am sorely tempted to run after this tangent now. |
Very interesting observations @chethega ! I hadn't considered the effect the branch history buffer has on benchmarking, nor the possible exploit. It boggles my mind too. |
* Faster findall for bitarrays * Add a few tests for findall for bitarrays * Code review updates for bitarray findall (JuliaLang#29888)
Inspired by a recent PR by @chethega for logically indexing a BitArray, and a challenge on Discourse to create an efficient
findall(::BitMatrix)
, here's my attempt -- an optimizedfindall
that works for anyBitArray
.The idea is very similar to the PR by @chethega ; using
trailing_zeros
and_blsr
to iterate through the bits. For multidimensional indices, when the index for a dimension grows larger than its size, it's carried over to the next dimension. I solve this with awhile
loop and recursive inlining.This version is around 0.7 - 75 times faster than the current
findall(::BitArray)
in my tests (on Intel Broadwell and Skylake; see timings below). The biggest speedups are for sparse matrices. It may perform worse than the current implementation for certain arrays, typically arrays that share one or more of the following traits: almost all values true (say >90%), has a small first dimension (say < 16), and has many dimensions (≥ 4-5, where the current code, due to its simplicity, is better at storing variables in registers). To mitigate this a bit, I threw in a cheap optimization for arrays that are all true.I experimented with a few other ideas to improve performance:
For an empty chunk, instead of adding 64 to the 1st dimension index, and then possibly doing several iterations to carry indices over to larger dimensions, pre-compute a vector of index additions. E.g. for a (5x5x5) array, adding 64 would add (4,2,2) to the indices. This technique greatly speeds up finding in sparse arrays where the first dimension is small (say < 16). However, it's slower for every other type of array. One could imagine an introspective algorithm that does this when the first dimension is small, however I'm not sure that it's worth the more complicated code.
Use the "Division by invariant integers using multiplication" technique to branchlessly update indices, at the cost of a few multiplications, shifts and subtractions. This proved to be slower than the carry-over solution in all cases except arrays where the first dimension is small. It also significantly increases the risk for bugs (like rounding errors for certain dimensions).
This is my first PR and contribution to Julia, so please bear with me if I've missed something in the process. It's probably a good idea to add a few more tests in
test/bitarray.jl
, I'm thinking to test higher dimensions, sparse matrices (empty chunks), all true matrices, etc. I'll wait with that until I get some feedback on this PR.Below are timings for a few differently sized arrays and fill rates, run on a 2.6 GHz Skylake, Julia 1.0.1, Ubuntu. To reproduce these experiments, run this script.