Faster findall for bitarrays #29888

maxbennedich · 2018-11-01T15:49:12Z

Inspired by a recent PR by @chethega for logically indexing a BitArray, and a challenge on Discourse to create an efficient findall(::BitMatrix), here's my attempt -- an optimized findall that works for any BitArray.

The idea is very similar to the PR by @chethega ; using trailing_zeros and _blsr to iterate through the bits. For multidimensional indices, when the index for a dimension grows larger than its size, it's carried over to the next dimension. I solve this with a while loop and recursive inlining.

This version is around 0.7 - 75 times faster than the current findall(::BitArray) in my tests (on Intel Broadwell and Skylake; see timings below). The biggest speedups are for sparse matrices. It may perform worse than the current implementation for certain arrays, typically arrays that share one or more of the following traits: almost all values true (say >90%), has a small first dimension (say < 16), and has many dimensions (≥ 4-5, where the current code, due to its simplicity, is better at storing variables in registers). To mitigate this a bit, I threw in a cheap optimization for arrays that are all true.

I experimented with a few other ideas to improve performance:

For an empty chunk, instead of adding 64 to the 1st dimension index, and then possibly doing several iterations to carry indices over to larger dimensions, pre-compute a vector of index additions. E.g. for a (5x5x5) array, adding 64 would add (4,2,2) to the indices. This technique greatly speeds up finding in sparse arrays where the first dimension is small (say < 16). However, it's slower for every other type of array. One could imagine an introspective algorithm that does this when the first dimension is small, however I'm not sure that it's worth the more complicated code.
Use the "Division by invariant integers using multiplication" technique to branchlessly update indices, at the cost of a few multiplications, shifts and subtractions. This proved to be slower than the carry-over solution in all cases except arrays where the first dimension is small. It also significantly increases the risk for bugs (like rounding errors for certain dimensions).

This is my first PR and contribution to Julia, so please bear with me if I've missed something in the process. It's probably a good idea to add a few more tests in test/bitarray.jl, I'm thinking to test higher dimensions, sparse matrices (empty chunks), all true matrices, etc. I'll wait with that until I get some feedback on this PR.

Below are timings for a few differently sized arrays and fill rates, run on a 2.6 GHz Skylake, Julia 1.0.1, Ubuntu. To reproduce these experiments, run this script.

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
-------------------------------------------------------------------------------------------------------------
          100000 |    0.1 % |   80.95 μs |  785.95 ns | 2043.47 |    1.12 μs |   10.84 ns |   28.18 | 72.52 x
          100000 |    1.0 % |   84.33 μs |   83.75 ns |  217.74 |    2.06 μs |    2.05 ns |    5.32 | 40.92 x
          100000 |    5.0 % |  110.87 μs |   22.32 ns |   58.03 |    6.60 μs |    1.33 ns |    3.45 | 16.80 x
          100000 |   20.1 % |  240.57 μs |   11.96 ns |   31.10 |   23.09 μs |    1.15 ns |    2.99 | 10.42 x
          100000 |   50.0 % |  347.19 μs |    6.94 ns |   18.04 |   42.96 μs |    0.86 ns |    2.23 |  8.08 x
          100000 |   80.0 % |  212.94 μs |    2.66 ns |    6.92 |   59.93 μs |    0.75 ns |    1.95 |  3.55 x
          100000 |   99.0 % |   91.03 μs |    0.92 ns |    2.39 |   71.33 μs |    0.72 ns |    1.87 |  1.28 x
          100000 |  100.0 % |   80.60 μs |    0.81 ns |    2.10 |   47.35 μs |    0.47 ns |    1.23 |  1.70 x
       191 x 211 |    0.1 % |   35.32 μs |  802.80 ns | 2087.27 |    0.53 μs |   12.08 ns |   31.42 | 66.44 x
       191 x 211 |    1.0 % |   41.88 μs |  102.15 ns |  265.58 |    1.09 μs |    2.66 ns |    6.93 | 38.34 x
       191 x 211 |    5.1 % |   51.54 μs |   25.05 ns |   65.14 |    2.97 μs |    1.45 ns |    3.76 | 17.33 x
       191 x 211 |   20.2 % |   91.44 μs |   11.23 ns |   29.20 |   11.91 μs |    1.46 ns |    3.80 |  7.68 x
       191 x 211 |   50.1 % |  150.58 μs |    7.46 ns |   19.40 |   25.44 μs |    1.26 ns |    3.28 |  5.92 x
       191 x 211 |   80.0 % |   96.48 μs |    2.99 ns |    7.78 |   39.09 μs |    1.21 ns |    3.15 |  2.47 x
       191 x 211 |   99.0 % |   58.39 μs |    1.46 ns |    3.81 |   47.41 μs |    1.19 ns |    3.09 |  1.23 x
       191 x 211 |  100.0 % |   53.74 μs |    1.33 ns |    3.47 |   36.30 μs |    0.90 ns |    2.34 |  1.48 x
   15 x 201 x 10 |    0.1 % |   31.97 μs | 1031.29 ns | 2681.35 |    1.15 μs |   37.19 ns |   96.69 | 27.73 x
   15 x 201 x 10 |    1.0 % |   28.17 μs |   91.46 ns |  237.81 |    1.51 μs |    4.89 ns |   12.71 | 18.71 x
   15 x 201 x 10 |    5.1 % |   42.36 μs |   27.69 ns |   71.99 |    3.26 μs |    2.13 ns |    5.54 | 12.98 x
   15 x 201 x 10 |   20.2 % |   82.04 μs |   13.48 ns |   35.06 |   17.71 μs |    2.91 ns |    7.57 |  4.63 x
   15 x 201 x 10 |   50.0 % |  123.95 μs |    8.22 ns |   21.38 |   34.65 μs |    2.30 ns |    5.98 |  3.58 x
   15 x 201 x 10 |   80.1 % |   83.27 μs |    3.45 ns |    8.96 |   49.03 μs |    2.03 ns |    5.28 |  1.70 x
   15 x 201 x 10 |   99.0 % |   52.59 μs |    1.76 ns |    4.58 |   48.26 μs |    1.62 ns |    4.20 |  1.09 x
   15 x 201 x 10 |  100.0 % |   41.09 μs |    1.36 ns |    3.54 |   38.28 μs |    1.27 ns |    3.30 |  1.07 x
 64 x 9 x 3 x 18 |    0.1 % |   31.06 μs |  913.62 ns | 2375.41 |    0.55 μs |   16.13 ns |   41.94 | 56.63 x
 64 x 9 x 3 x 18 |    1.0 % |   32.98 μs |  102.74 ns |  267.14 |    1.23 μs |    3.82 ns |    9.93 | 26.90 x
 64 x 9 x 3 x 18 |    5.1 % |   39.37 μs |   24.90 ns |   64.75 |    4.70 μs |    2.97 ns |    7.73 |  8.38 x
 64 x 9 x 3 x 18 |   20.1 % |   71.85 μs |   11.47 ns |   29.83 |   14.86 μs |    2.37 ns |    6.17 |  4.84 x
 64 x 9 x 3 x 18 |   50.0 % |  114.08 μs |    7.34 ns |   19.08 |   34.62 μs |    2.23 ns |    5.79 |  3.29 x
 64 x 9 x 3 x 18 |   80.2 % |   77.28 μs |    3.10 ns |    8.06 |   56.98 μs |    2.28 ns |    5.94 |  1.36 x
 64 x 9 x 3 x 18 |   99.0 % |   62.85 μs |    2.04 ns |    5.31 |   68.60 μs |    2.23 ns |    5.79 |  0.92 x
 64 x 9 x 3 x 18 |  100.0 % |   69.02 μs |    2.22 ns |    5.77 |   60.30 μs |    1.94 ns |    5.04 |  1.14 x

chethega · 2018-11-06T22:52:50Z

Sorry for taking so long to properly respond to this.

Very cool. Your solution widely surpasses my expectation when I issued the challenge on discourse, I did not expect that this can get so fast!

maxbennedich · 2018-11-07T17:18:42Z

Thanks for that. I added a few tests now. What's the next step here? Would anyone like to take a stab at reviewing the code?

mbauman

This is awesome. Thank you so much for the contribution. I have just a few really minor nit-picky comments, but this is really impressive and obviously a great improvement.

mbauman · 2018-11-07T17:29:14Z

base/bitarray.jl

    nnzB == 0 && return I
+    nnzB == length(B) && (allindices!(I, B); return I)


allindices! seems like it should be able to be faster/more generic/less code. It's a little annoying though since we don't yet have the generic Vector(itr) constructor. Maybe it should just be vec(collect(keys(B))) and move the short circuit return to be before you construct I.

It annoyed me too that I needed almost as many lines of code for the allindices! functions as for findall itself. vec(collect(keys(B))) is a great suggestion for vectors and arrays of dim >= 3, but I am seeing much worse performance for matrices (2 dims). This is the simple test script I'm using:

for B in [trues(100000), trues(200, 200), trues(50, 50, 50), trues(16, 16, 16, 16)] print(size(B)); @btime findall_optimized($B) print(size(B)); @btime vec(collect(keys($B))) end

With results:

(100000,) 56.197 μs (3 allocations: 781.38 KiB) (100000,) 55.882 μs (3 allocations: 781.34 KiB) (200, 200) 49.331 μs (2 allocations: 625.08 KiB) (200, 200) 72.926 μs (5 allocations: 625.19 KiB) (50, 50, 50) 222.002 μs (2 allocations: 2.86 MiB) (50, 50, 50) 225.390 μs (5 allocations: 2.86 MiB) (16, 16, 16, 16) 151.709 μs (2 allocations: 2.00 MiB) (16, 16, 16, 16) 155.849 μs (6 allocations: 2.00 MiB)

In fact, for matrices, it would be better then to turn off this special case optimization. Timings for findall_optimized without using allindices!:

(100000,) 74.627 μs (2 allocations: 781.33 KiB) (200, 200) 52.787 μs (2 allocations: 625.08 KiB) (50, 50, 50) 234.702 μs (2 allocations: 2.86 MiB) (16, 16, 16, 16) 165.563 μs (2 allocations: 2.00 MiB)

While I think some performance can be sacrificed for simpler code, IMO the degradation for matrices is a bit much. Can you think of a performant solution that works for arrays of all dimensions? If not, two alternatives are: 1) keep allindices! (or _allindices!) but with only two cases: the BitMatrix one as is, and vec(collect(keys(B))) for all other BitArrays; or 2) make vec(collect(keys(B))) fast for matrices.

Thanks for the thorough testing here. I think what you have makes sense and is just fine.

mbauman · 2018-11-07T17:55:53Z

base/bitarray.jl

-            Icount += 1
+    Bs = size(B)
+    Bi = i1 = i = 1
+    irest = ntuple(one, length(B.dims) - 1)


Dang, constant propagation is amazing — I had to check that this was type stable. I would slightly prefer ndims(B) over length(B.dims) — they're the same but B.dims initially worried me since its contents are undefined for BitVectors (but of course its length is defined and so this does indeed work as you wrote it).

Ah, great, wasn't aware of ndims!

mbauman · 2018-11-07T17:58:43Z

base/bitarray.jl

+    end
+end
+
+@inline overflowind(i1, irest::Tuple{}, size) = (i1, irest)


I'd prefer to name this _overflowind (and toind below to _toind) — they're helper functions that are only relevant to this one method, but those are fairly common names and likely to be mistaken for to_indices.

mbauman · 2018-11-07T18:03:35Z

test/bitarray.jl

+    @check_bit_operation findall(b1) Vector{CartesianIndex{3}}
+
+    # BitArrays of various dimensions
+    for dims = 2:8


Suggested change

for dims = 2:8

for dims = 0:8

Let's also add tests for 0-dimensional arrays — they work due to the early exits, but would fail the general algorithm if that wasn't the case.

Good idea. Had to update the code slightly to work for the 1-dimensional case.

mbauman · 2018-11-07T18:21:56Z

I tried a few other sizes with your benchmarking script trying to assess the worst case scenarios… and even then this is performing spectacularly. I had to bend over backwards to find anything that's remotely a regression, and even then only in a few circumstances!

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
    1 x 201 x 10 |    0.1 % |    2.91 μs | 1453.33 ns | 3197.33 |    0.71 μs |  352.79 ns |  776.13 |  4.12 x
    1 x 201 x 10 |    0.7 % |    2.98 μs |  212.87 ns |  468.32 |    0.89 μs |   63.61 ns |  139.94 |  3.35 x
    1 x 201 x 10 |    5.2 % |    3.51 μs |   33.77 ns |   74.30 |    1.29 μs |   12.43 ns |   27.34 |  2.72 x
    1 x 201 x 10 |   20.0 % |    4.26 μs |   10.57 ns |   23.24 |    1.53 μs |    3.81 ns |    8.37 |  2.78 x
    1 x 201 x 10 |   50.8 % |    4.38 μs |    4.29 ns |    9.44 |    2.58 μs |    2.52 ns |    5.55 |  1.70 x
    1 x 201 x 10 |   79.9 % |    4.54 μs |    2.83 ns |    6.22 |    3.62 μs |    2.25 ns |    4.96 |  1.25 x
    1 x 201 x 10 |   98.9 % |    4.98 μs |    2.51 ns |    5.52 |    4.45 μs |    2.24 ns |    4.93 |  1.12 x
    1 x 201 x 10 |  100.0 % |    4.93 μs |    2.45 ns |    5.40 |    3.89 μs |    1.93 ns |    4.25 |  1.27 x
    2 x 3 x 1000 |    0.2 % |   11.98 μs | 1089.09 ns | 2396.00 |    1.64 μs |  148.96 ns |  327.72 |  7.31 x
    2 x 3 x 1000 |    1.0 % |   12.14 μs |  209.26 ns |  460.37 |    1.94 μs |   33.39 ns |   73.45 |  6.27 x
    2 x 3 x 1000 |    4.8 % |   12.92 μs |   45.33 ns |   99.73 |    2.30 μs |    8.07 ns |   17.75 |  5.62 x
    2 x 3 x 1000 |   19.7 % |   16.15 μs |   13.64 ns |   30.01 |    4.45 μs |    3.75 ns |    8.26 |  3.63 x
    2 x 3 x 1000 |   50.9 % |   24.21 μs |    7.93 ns |   17.45 |   11.66 μs |    3.82 ns |    8.40 |  2.08 x
    2 x 3 x 1000 |   80.6 % |   13.94 μs |    2.88 ns |    6.34 |   13.85 μs |    2.86 ns |    6.30 |  1.01 x
    2 x 3 x 1000 |   99.1 % |   12.67 μs |    2.13 ns |    4.69 |   12.23 μs |    2.06 ns |    4.53 |  1.04 x
    2 x 3 x 1000 |  100.0 % |   12.20 μs |    2.03 ns |    4.47 |   10.16 μs |    1.69 ns |    3.73 |  1.20 x
  1 x 1 x 100000 |    0.1 % |  140.78 μs | 1366.84 ns | 3007.06 |   72.82 μs |  707.01 ns | 1555.42 |  1.93 x
  1 x 1 x 100000 |    1.0 % |  146.35 μs |  145.33 ns |  319.73 |   93.78 μs |   93.12 ns |  204.87 |  1.56 x
  1 x 1 x 100000 |    5.0 % |  190.70 μs |   38.39 ns |   84.47 |  156.51 μs |   31.51 ns |   69.32 |  1.22 x
  1 x 1 x 100000 |   20.1 % |  352.04 μs |   17.51 ns |   38.51 |  305.54 μs |   15.19 ns |   33.43 |  1.15 x
  1 x 1 x 100000 |   50.0 % |  503.80 μs |   10.07 ns |   22.15 |  503.27 μs |   10.06 ns |   22.13 |  1.00 x
  1 x 1 x 100000 |   80.0 % |  338.91 μs |    4.24 ns |    9.32 |  355.38 μs |    4.44 ns |    9.77 |  0.95 x
  1 x 1 x 100000 |   99.0 % |  249.24 μs |    2.52 ns |    5.54 |  245.93 μs |    2.49 ns |    5.47 |  1.01 x
  1 x 1 x 100000 |  100.0 % |  252.11 μs |    2.52 ns |    5.55 |  218.66 μs |    2.19 ns |    4.81 |  1.15 x
  2 x 1 x 100000 |    0.1 % |  394.49 μs | 1915.01 ns | 4213.03 |   76.54 μs |  371.56 ns |  817.44 |  5.15 x
  2 x 1 x 100000 |    1.0 % |  406.35 μs |  204.61 ns |  450.13 |  117.61 μs |   59.22 ns |  130.28 |  3.46 x
  2 x 1 x 100000 |    4.9 % |  451.07 μs |   45.79 ns |  100.73 |  224.80 μs |   22.82 ns |   50.20 |  2.01 x
  2 x 1 x 100000 |   20.1 % |  667.98 μs |   16.65 ns |   36.63 |  494.97 μs |   12.34 ns |   27.14 |  1.35 x
  2 x 1 x 100000 |   50.1 % |  945.96 μs |    9.44 ns |   20.77 |  643.56 μs |    6.42 ns |   14.13 |  1.47 x
  2 x 1 x 100000 |   80.1 % |  647.65 μs |    4.04 ns |    8.90 |  625.24 μs |    3.90 ns |    8.59 |  1.04 x
  2 x 1 x 100000 |   99.0 % |  429.32 μs |    2.17 ns |    4.77 |  481.60 μs |    2.43 ns |    5.35 |  0.89 x
  2 x 1 x 100000 |  100.0 % |  471.20 μs |    2.36 ns |    5.18 |  419.94 μs |    2.10 ns |    4.62 |  1.12 x
1 x 1 x 2 x 1 x 10000 |    0.1 % |   43.43 μs | 2068.14 ns | 4549.91 |   21.44 μs | 1021.10 ns | 2246.41 |  2.03 x
1 x 1 x 2 x 1 x 10000 |    1.1 % |   45.44 μs |  212.32 ns |  467.10 |   28.08 μs |  131.22 ns |  288.68 |  1.62 x
1 x 1 x 2 x 1 x 10000 |    5.1 % |   52.11 μs |   51.59 ns |  113.50 |   44.14 μs |   43.70 ns |   96.14 |  1.18 x
1 x 1 x 2 x 1 x 10000 |   19.9 % |   75.65 μs |   19.05 ns |   41.92 |   79.29 μs |   19.97 ns |   43.94 |  0.95 x
1 x 1 x 2 x 1 x 10000 |   49.7 % |  113.96 μs |   11.46 ns |   25.20 |  126.28 μs |   12.69 ns |   27.93 |  0.90 x
1 x 1 x 2 x 1 x 10000 |   80.1 % |   91.50 μs |    5.71 ns |   12.56 |  110.59 μs |    6.90 ns |   15.18 |  0.83 x
1 x 1 x 2 x 1 x 10000 |   99.0 % |   84.66 μs |    4.28 ns |    9.41 |  105.26 μs |    5.32 ns |   11.70 |  0.80 x
1 x 1 x 2 x 1 x 10000 |  100.0 % |   85.04 μs |    4.25 ns |    9.35 |   53.54 μs |    2.68 ns |    5.89 |  1.59 x

StefanKarpinski · 2018-11-08T15:45:55Z

What remains to be done here? This PR seems to be in good shape. Is it time to merge?

chethega · 2018-11-08T16:00:47Z

Is it time to merge?

Nanosoldier to catch weird surprises, and either delight in the nice new number or add "benchmarks beneficial" tag before merging?

KristofferC · 2018-11-08T16:04:34Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

maxbennedich · 2018-11-10T08:18:43Z

What happened to the nanosoldier run? Looking at BaseBenchmarkReports, the "daily" report has not been produced since Nov 3, so perhaps the service is having trouble? (Although there was a run completing 3 hours ago.)

KristofferC · 2018-11-10T14:28:00Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

KristofferC · 2018-11-10T14:28:06Z

Seems to be running now

nanosoldier · 2018-11-10T21:32:17Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

chethega · 2018-11-10T23:06:31Z

Hmm. It looks like we are missing benchmarks for multidimensional findall.

The reported non-improvement for 90% full 1000-element bitvectors looks reproducible with @btime. However, it appears like @btime creates crazy artifacts on the tiny examples. Without this patch:

 julia> for N=[100, 1000, 10_000, 100_000]
       r=rand(N); bx5=r.>0.5; bx1=r.>0.1; bx9 = r.>0.9;
       @show N
       for bx in [bx1, bx5, bx9]
              @btime findall($bx);
              end;
       end
N = 100
  193.788 ns (1 allocation: 896 bytes)
  190.135 ns (1 allocation: 544 bytes)
  134.106 ns (1 allocation: 144 bytes)
N = 1000
  1.602 μs (1 allocation: 7.13 KiB)
  1.701 μs (1 allocation: 4.06 KiB)
  1.086 μs (1 allocation: 1008 bytes)
N = 10000
  18.180 μs (2 allocations: 70.27 KiB)
  44.844 μs (2 allocations: 39.52 KiB)
  18.940 μs (1 allocation: 8.13 KiB)
N = 100000
  225.082 μs (2 allocations: 703.52 KiB)
  491.945 μs (2 allocations: 389.64 KiB)
  212.360 μs (2 allocations: 78.20 KiB)

Observe the superlinear jump for 0.5 density from N=1000 to N=10_000.
My favorite Agner Fog has the following to say: The precise mechanism of branch prediction on haswell is unknown; several parts of the puzzle are explained; a miss costs 15-20 cycles.

For N=1000, we expect 500 misses, which costs 3.75 us. The reported time of 1.7us cannot be true, unless the sneaky CPU uses the benchmark loop to learn the tested bitarray. Since each iteration tests the same pattern, well, there we go.

What I believe happens is that, over the benchmark loop, we fill most of the branch history buffer space with our one critical branch. The history buffer contains counters for possible subsequences; it is apparently large enough to encode a significant fraction of our fixed test array. Benchmarking is hard!

I think we can merge this and need to think about how to avoid this problem in the future: It looks like we overestimated the speed of the old findall by a factor of 3. This issue can apply for all small branchy microbenchmarks. We can either increase size of testsets, or we can interleave tests runs (such that cache and branch predictor are cold).

chethega · 2018-11-10T23:30:31Z

So, just spinning more thoughts. This benchmarking artifact is absolutely mindblowing to me.

What this means is that we have a potential spectre-type gadget: Suppose we have a situation where we can repeatedly ask the kernel to run something like the old findall on a secret buffer, and the branch history table is not cleared. Afaik current spectre mitigations only deal with the BTB and BHT poisoning, not BHT sniffing.

Then we can probably reassemble large parts of 1000 bit secrets, the same way biologists assemble a genome from very short reads. Our read-length is the length of histories stored; Agner Fog suggests that they are quite long (18-32 bit). Neat!

I am sorely tempted to run after this tangent now.

maxbennedich · 2018-11-11T09:09:49Z

Very interesting observations @chethega ! I hadn't considered the effect the branch history buffer has on benchmarking, nor the possible exploit. It boggles my mind too.

* Faster findall for bitarrays * Add a few tests for findall for bitarrays * Code review updates for bitarray findall (#29888) (cherry picked from commit 96ce5ba)

* Faster findall for bitarrays * Add a few tests for findall for bitarrays * Code review updates for bitarray findall (JuliaLang#29888)

* Faster findall for bitarrays * Add a few tests for findall for bitarrays * Code review updates for bitarray findall (#29888) (cherry picked from commit 96ce5ba)

Faster findall for bitarrays

cd91099

nalimilan added the performance Must go faster label Nov 1, 2018

Add a few tests for findall for bitarrays

018de19

mbauman reviewed Nov 7, 2018

View reviewed changes

Code review updates for bitarray findall (JuliaLang#29888)

8f76855

mbauman approved these changes Nov 8, 2018

View reviewed changes

KristofferC merged commit 96ce5ba into JuliaLang:master Nov 10, 2018

KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 10, 2018

maxbennedich deleted the faster-findall-bitarray branch November 11, 2018 09:10

chethega mentioned this pull request Nov 11, 2018

Sneaky branch-predictor remembering inputs over benchmark loops JuliaCI/BaseBenchmarks.jl#241

Open

KristofferC added the backport pending 1.0 label Nov 16, 2018

KristofferC pushed a commit that referenced this pull request Nov 19, 2018

Faster findall for bitarrays (#29888)

f501a26

* Faster findall for bitarrays * Add a few tests for findall for bitarrays * Code review updates for bitarray findall (#29888) (cherry picked from commit 96ce5ba)

KristofferC mentioned this pull request Nov 19, 2018

Backports for 1.0.3 #30010

Merged

61 tasks

tkf pushed a commit to tkf/julia that referenced this pull request Nov 21, 2018

Faster findall for bitarrays (JuliaLang#29888)

b3e9540

* Faster findall for bitarrays * Add a few tests for findall for bitarrays * Code review updates for bitarray findall (JuliaLang#29888)

KristofferC pushed a commit that referenced this pull request Dec 12, 2018

Faster findall for bitarrays (#29888)

3b34734

* Faster findall for bitarrays * Add a few tests for findall for bitarrays * Code review updates for bitarray findall (#29888) (cherry picked from commit 96ce5ba)

KristofferC removed the backport pending 1.0 label Dec 12, 2018

KristofferC pushed a commit that referenced this pull request Feb 11, 2019

Faster findall for bitarrays (#29888)

d8d66fe

* Faster findall for bitarrays * Add a few tests for findall for bitarrays * Code review updates for bitarray findall (#29888) (cherry picked from commit 96ce5ba)

KristofferC pushed a commit that referenced this pull request Feb 20, 2020

Faster findall for bitarrays (#29888)

2f7bc60

* Faster findall for bitarrays * Add a few tests for findall for bitarrays * Code review updates for bitarray findall (#29888) (cherry picked from commit 96ce5ba)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster findall for bitarrays #29888

Faster findall for bitarrays #29888

maxbennedich commented Nov 1, 2018

chethega commented Nov 6, 2018

maxbennedich commented Nov 7, 2018

mbauman left a comment

mbauman Nov 7, 2018

maxbennedich Nov 8, 2018

mbauman Nov 8, 2018

mbauman Nov 7, 2018

maxbennedich Nov 8, 2018

mbauman Nov 7, 2018

mbauman Nov 7, 2018

maxbennedich Nov 8, 2018

mbauman commented Nov 7, 2018

StefanKarpinski commented Nov 8, 2018

chethega commented Nov 8, 2018

KristofferC commented Nov 8, 2018

maxbennedich commented Nov 10, 2018

KristofferC commented Nov 10, 2018

KristofferC commented Nov 10, 2018

nanosoldier commented Nov 10, 2018

chethega commented Nov 10, 2018

chethega commented Nov 10, 2018

maxbennedich commented Nov 11, 2018

		nnzB == 0 && return I
		nnzB == length(B) && (allindices!(I, B); return I)

Faster findall for bitarrays #29888

Faster findall for bitarrays #29888

Conversation

maxbennedich commented Nov 1, 2018

chethega commented Nov 6, 2018

maxbennedich commented Nov 7, 2018

mbauman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbauman commented Nov 7, 2018

StefanKarpinski commented Nov 8, 2018

chethega commented Nov 8, 2018

KristofferC commented Nov 8, 2018

maxbennedich commented Nov 10, 2018

KristofferC commented Nov 10, 2018

KristofferC commented Nov 10, 2018

nanosoldier commented Nov 10, 2018

chethega commented Nov 10, 2018

chethega commented Nov 10, 2018

maxbennedich commented Nov 11, 2018