Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster findall for bitarrays #29888

Merged
merged 3 commits into from
Nov 10, 2018

Conversation

maxbennedich
Copy link
Contributor

Inspired by a recent PR by @chethega for logically indexing a BitArray, and a challenge on Discourse to create an efficient findall(::BitMatrix), here's my attempt -- an optimized findall that works for any BitArray.

The idea is very similar to the PR by @chethega ; using trailing_zeros and _blsr to iterate through the bits. For multidimensional indices, when the index for a dimension grows larger than its size, it's carried over to the next dimension. I solve this with a while loop and recursive inlining.

This version is around 0.7 - 75 times faster than the current findall(::BitArray) in my tests (on Intel Broadwell and Skylake; see timings below). The biggest speedups are for sparse matrices. It may perform worse than the current implementation for certain arrays, typically arrays that share one or more of the following traits: almost all values true (say >90%), has a small first dimension (say < 16), and has many dimensions (≥ 4-5, where the current code, due to its simplicity, is better at storing variables in registers). To mitigate this a bit, I threw in a cheap optimization for arrays that are all true.

I experimented with a few other ideas to improve performance:

  • For an empty chunk, instead of adding 64 to the 1st dimension index, and then possibly doing several iterations to carry indices over to larger dimensions, pre-compute a vector of index additions. E.g. for a (5x5x5) array, adding 64 would add (4,2,2) to the indices. This technique greatly speeds up finding in sparse arrays where the first dimension is small (say < 16). However, it's slower for every other type of array. One could imagine an introspective algorithm that does this when the first dimension is small, however I'm not sure that it's worth the more complicated code.

  • Use the "Division by invariant integers using multiplication" technique to branchlessly update indices, at the cost of a few multiplications, shifts and subtractions. This proved to be slower than the carry-over solution in all cases except arrays where the first dimension is small. It also significantly increases the risk for bugs (like rounding errors for certain dimensions).

This is my first PR and contribution to Julia, so please bear with me if I've missed something in the process. It's probably a good idea to add a few more tests in test/bitarray.jl, I'm thinking to test higher dimensions, sparse matrices (empty chunks), all true matrices, etc. I'll wait with that until I get some feedback on this PR.

Below are timings for a few differently sized arrays and fill rates, run on a 2.6 GHz Skylake, Julia 1.0.1, Ubuntu. To reproduce these experiments, run this script.

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
-------------------------------------------------------------------------------------------------------------
          100000 |    0.1 % |   80.95 μs |  785.95 ns | 2043.47 |    1.12 μs |   10.84 ns |   28.18 | 72.52 x
          100000 |    1.0 % |   84.33 μs |   83.75 ns |  217.74 |    2.06 μs |    2.05 ns |    5.32 | 40.92 x
          100000 |    5.0 % |  110.87 μs |   22.32 ns |   58.03 |    6.60 μs |    1.33 ns |    3.45 | 16.80 x
          100000 |   20.1 % |  240.57 μs |   11.96 ns |   31.10 |   23.09 μs |    1.15 ns |    2.99 | 10.42 x
          100000 |   50.0 % |  347.19 μs |    6.94 ns |   18.04 |   42.96 μs |    0.86 ns |    2.23 |  8.08 x
          100000 |   80.0 % |  212.94 μs |    2.66 ns |    6.92 |   59.93 μs |    0.75 ns |    1.95 |  3.55 x
          100000 |   99.0 % |   91.03 μs |    0.92 ns |    2.39 |   71.33 μs |    0.72 ns |    1.87 |  1.28 x
          100000 |  100.0 % |   80.60 μs |    0.81 ns |    2.10 |   47.35 μs |    0.47 ns |    1.23 |  1.70 x
       191 x 211 |    0.1 % |   35.32 μs |  802.80 ns | 2087.27 |    0.53 μs |   12.08 ns |   31.42 | 66.44 x
       191 x 211 |    1.0 % |   41.88 μs |  102.15 ns |  265.58 |    1.09 μs |    2.66 ns |    6.93 | 38.34 x
       191 x 211 |    5.1 % |   51.54 μs |   25.05 ns |   65.14 |    2.97 μs |    1.45 ns |    3.76 | 17.33 x
       191 x 211 |   20.2 % |   91.44 μs |   11.23 ns |   29.20 |   11.91 μs |    1.46 ns |    3.80 |  7.68 x
       191 x 211 |   50.1 % |  150.58 μs |    7.46 ns |   19.40 |   25.44 μs |    1.26 ns |    3.28 |  5.92 x
       191 x 211 |   80.0 % |   96.48 μs |    2.99 ns |    7.78 |   39.09 μs |    1.21 ns |    3.15 |  2.47 x
       191 x 211 |   99.0 % |   58.39 μs |    1.46 ns |    3.81 |   47.41 μs |    1.19 ns |    3.09 |  1.23 x
       191 x 211 |  100.0 % |   53.74 μs |    1.33 ns |    3.47 |   36.30 μs |    0.90 ns |    2.34 |  1.48 x
   15 x 201 x 10 |    0.1 % |   31.97 μs | 1031.29 ns | 2681.35 |    1.15 μs |   37.19 ns |   96.69 | 27.73 x
   15 x 201 x 10 |    1.0 % |   28.17 μs |   91.46 ns |  237.81 |    1.51 μs |    4.89 ns |   12.71 | 18.71 x
   15 x 201 x 10 |    5.1 % |   42.36 μs |   27.69 ns |   71.99 |    3.26 μs |    2.13 ns |    5.54 | 12.98 x
   15 x 201 x 10 |   20.2 % |   82.04 μs |   13.48 ns |   35.06 |   17.71 μs |    2.91 ns |    7.57 |  4.63 x
   15 x 201 x 10 |   50.0 % |  123.95 μs |    8.22 ns |   21.38 |   34.65 μs |    2.30 ns |    5.98 |  3.58 x
   15 x 201 x 10 |   80.1 % |   83.27 μs |    3.45 ns |    8.96 |   49.03 μs |    2.03 ns |    5.28 |  1.70 x
   15 x 201 x 10 |   99.0 % |   52.59 μs |    1.76 ns |    4.58 |   48.26 μs |    1.62 ns |    4.20 |  1.09 x
   15 x 201 x 10 |  100.0 % |   41.09 μs |    1.36 ns |    3.54 |   38.28 μs |    1.27 ns |    3.30 |  1.07 x
 64 x 9 x 3 x 18 |    0.1 % |   31.06 μs |  913.62 ns | 2375.41 |    0.55 μs |   16.13 ns |   41.94 | 56.63 x
 64 x 9 x 3 x 18 |    1.0 % |   32.98 μs |  102.74 ns |  267.14 |    1.23 μs |    3.82 ns |    9.93 | 26.90 x
 64 x 9 x 3 x 18 |    5.1 % |   39.37 μs |   24.90 ns |   64.75 |    4.70 μs |    2.97 ns |    7.73 |  8.38 x
 64 x 9 x 3 x 18 |   20.1 % |   71.85 μs |   11.47 ns |   29.83 |   14.86 μs |    2.37 ns |    6.17 |  4.84 x
 64 x 9 x 3 x 18 |   50.0 % |  114.08 μs |    7.34 ns |   19.08 |   34.62 μs |    2.23 ns |    5.79 |  3.29 x
 64 x 9 x 3 x 18 |   80.2 % |   77.28 μs |    3.10 ns |    8.06 |   56.98 μs |    2.28 ns |    5.94 |  1.36 x
 64 x 9 x 3 x 18 |   99.0 % |   62.85 μs |    2.04 ns |    5.31 |   68.60 μs |    2.23 ns |    5.79 |  0.92 x
 64 x 9 x 3 x 18 |  100.0 % |   69.02 μs |    2.22 ns |    5.77 |   60.30 μs |    1.94 ns |    5.04 |  1.14 x

@nalimilan nalimilan added the performance Must go faster label Nov 1, 2018
@chethega
Copy link
Contributor

chethega commented Nov 6, 2018

Sorry for taking so long to properly respond to this.

Very cool. Your solution widely surpasses my expectation when I issued the challenge on discourse, I did not expect that this can get so fast!

@maxbennedich
Copy link
Contributor Author

Thanks for that. I added a few tests now. What's the next step here? Would anyone like to take a stab at reviewing the code?

Copy link
Member

@mbauman mbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. Thank you so much for the contribution. I have just a few really minor nit-picky comments, but this is really impressive and obviously a great improvement.

nnzB == 0 && return I
nnzB == length(B) && (allindices!(I, B); return I)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allindices! seems like it should be able to be faster/more generic/less code. It's a little annoying though since we don't yet have the generic Vector(itr) constructor. Maybe it should just be vec(collect(keys(B))) and move the short circuit return to be before you construct I.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It annoyed me too that I needed almost as many lines of code for the allindices! functions as for findall itself. vec(collect(keys(B))) is a great suggestion for vectors and arrays of dim >= 3, but I am seeing much worse performance for matrices (2 dims). This is the simple test script I'm using:

for B in [trues(100000), trues(200, 200), trues(50, 50, 50), trues(16, 16, 16, 16)]
    print(size(B)); @btime findall_optimized($B)
    print(size(B)); @btime vec(collect(keys($B)))
end

With results:

(100000,)  56.197 μs (3 allocations: 781.38 KiB)
(100000,)  55.882 μs (3 allocations: 781.34 KiB)
(200, 200)  49.331 μs (2 allocations: 625.08 KiB)
(200, 200)  72.926 μs (5 allocations: 625.19 KiB)
(50, 50, 50)  222.002 μs (2 allocations: 2.86 MiB)
(50, 50, 50)  225.390 μs (5 allocations: 2.86 MiB)
(16, 16, 16, 16)  151.709 μs (2 allocations: 2.00 MiB)
(16, 16, 16, 16)  155.849 μs (6 allocations: 2.00 MiB)

In fact, for matrices, it would be better then to turn off this special case optimization. Timings for findall_optimized without using allindices!:

(100000,)  74.627 μs (2 allocations: 781.33 KiB)
(200, 200)  52.787 μs (2 allocations: 625.08 KiB)
(50, 50, 50)  234.702 μs (2 allocations: 2.86 MiB)
(16, 16, 16, 16)  165.563 μs (2 allocations: 2.00 MiB)

While I think some performance can be sacrificed for simpler code, IMO the degradation for matrices is a bit much. Can you think of a performant solution that works for arrays of all dimensions? If not, two alternatives are: 1) keep allindices! (or _allindices!) but with only two cases: the BitMatrix one as is, and vec(collect(keys(B))) for all other BitArrays; or 2) make vec(collect(keys(B))) fast for matrices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough testing here. I think what you have makes sense and is just fine.

base/bitarray.jl Outdated
Icount += 1
Bs = size(B)
Bi = i1 = i = 1
irest = ntuple(one, length(B.dims) - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang, constant propagation is amazing — I had to check that this was type stable. I would slightly prefer ndims(B) over length(B.dims) — they're the same but B.dims initially worried me since its contents are undefined for BitVectors (but of course its length is defined and so this does indeed work as you wrote it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, great, wasn't aware of ndims!

base/bitarray.jl Outdated
end
end

@inline overflowind(i1, irest::Tuple{}, size) = (i1, irest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to name this _overflowind (and toind below to _toind) — they're helper functions that are only relevant to this one method, but those are fairly common names and likely to be mistaken for to_indices.

test/bitarray.jl Outdated
@check_bit_operation findall(b1) Vector{CartesianIndex{3}}

# BitArrays of various dimensions
for dims = 2:8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for dims = 2:8
for dims = 0:8

Let's also add tests for 0-dimensional arrays — they work due to the early exits, but would fail the general algorithm if that wasn't the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Had to update the code slightly to work for the 1-dimensional case.

@mbauman
Copy link
Member

mbauman commented Nov 7, 2018

I tried a few other sizes with your benchmarking script trying to assess the worst case scenarios… and even then this is performing spectacularly. I had to bend over backwards to find anything that's remotely a regression, and even then only in a few circumstances!

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
    1 x 201 x 10 |    0.1 % |    2.91 μs | 1453.33 ns | 3197.33 |    0.71 μs |  352.79 ns |  776.13 |  4.12 x
    1 x 201 x 10 |    0.7 % |    2.98 μs |  212.87 ns |  468.32 |    0.89 μs |   63.61 ns |  139.94 |  3.35 x
    1 x 201 x 10 |    5.2 % |    3.51 μs |   33.77 ns |   74.30 |    1.29 μs |   12.43 ns |   27.34 |  2.72 x
    1 x 201 x 10 |   20.0 % |    4.26 μs |   10.57 ns |   23.24 |    1.53 μs |    3.81 ns |    8.37 |  2.78 x
    1 x 201 x 10 |   50.8 % |    4.38 μs |    4.29 ns |    9.44 |    2.58 μs |    2.52 ns |    5.55 |  1.70 x
    1 x 201 x 10 |   79.9 % |    4.54 μs |    2.83 ns |    6.22 |    3.62 μs |    2.25 ns |    4.96 |  1.25 x
    1 x 201 x 10 |   98.9 % |    4.98 μs |    2.51 ns |    5.52 |    4.45 μs |    2.24 ns |    4.93 |  1.12 x
    1 x 201 x 10 |  100.0 % |    4.93 μs |    2.45 ns |    5.40 |    3.89 μs |    1.93 ns |    4.25 |  1.27 x
    2 x 3 x 1000 |    0.2 % |   11.98 μs | 1089.09 ns | 2396.00 |    1.64 μs |  148.96 ns |  327.72 |  7.31 x
    2 x 3 x 1000 |    1.0 % |   12.14 μs |  209.26 ns |  460.37 |    1.94 μs |   33.39 ns |   73.45 |  6.27 x
    2 x 3 x 1000 |    4.8 % |   12.92 μs |   45.33 ns |   99.73 |    2.30 μs |    8.07 ns |   17.75 |  5.62 x
    2 x 3 x 1000 |   19.7 % |   16.15 μs |   13.64 ns |   30.01 |    4.45 μs |    3.75 ns |    8.26 |  3.63 x
    2 x 3 x 1000 |   50.9 % |   24.21 μs |    7.93 ns |   17.45 |   11.66 μs |    3.82 ns |    8.40 |  2.08 x
    2 x 3 x 1000 |   80.6 % |   13.94 μs |    2.88 ns |    6.34 |   13.85 μs |    2.86 ns |    6.30 |  1.01 x
    2 x 3 x 1000 |   99.1 % |   12.67 μs |    2.13 ns |    4.69 |   12.23 μs |    2.06 ns |    4.53 |  1.04 x
    2 x 3 x 1000 |  100.0 % |   12.20 μs |    2.03 ns |    4.47 |   10.16 μs |    1.69 ns |    3.73 |  1.20 x
  1 x 1 x 100000 |    0.1 % |  140.78 μs | 1366.84 ns | 3007.06 |   72.82 μs |  707.01 ns | 1555.42 |  1.93 x
  1 x 1 x 100000 |    1.0 % |  146.35 μs |  145.33 ns |  319.73 |   93.78 μs |   93.12 ns |  204.87 |  1.56 x
  1 x 1 x 100000 |    5.0 % |  190.70 μs |   38.39 ns |   84.47 |  156.51 μs |   31.51 ns |   69.32 |  1.22 x
  1 x 1 x 100000 |   20.1 % |  352.04 μs |   17.51 ns |   38.51 |  305.54 μs |   15.19 ns |   33.43 |  1.15 x
  1 x 1 x 100000 |   50.0 % |  503.80 μs |   10.07 ns |   22.15 |  503.27 μs |   10.06 ns |   22.13 |  1.00 x
  1 x 1 x 100000 |   80.0 % |  338.91 μs |    4.24 ns |    9.32 |  355.38 μs |    4.44 ns |    9.77 |  0.95 x
  1 x 1 x 100000 |   99.0 % |  249.24 μs |    2.52 ns |    5.54 |  245.93 μs |    2.49 ns |    5.47 |  1.01 x
  1 x 1 x 100000 |  100.0 % |  252.11 μs |    2.52 ns |    5.55 |  218.66 μs |    2.19 ns |    4.81 |  1.15 x
  2 x 1 x 100000 |    0.1 % |  394.49 μs | 1915.01 ns | 4213.03 |   76.54 μs |  371.56 ns |  817.44 |  5.15 x
  2 x 1 x 100000 |    1.0 % |  406.35 μs |  204.61 ns |  450.13 |  117.61 μs |   59.22 ns |  130.28 |  3.46 x
  2 x 1 x 100000 |    4.9 % |  451.07 μs |   45.79 ns |  100.73 |  224.80 μs |   22.82 ns |   50.20 |  2.01 x
  2 x 1 x 100000 |   20.1 % |  667.98 μs |   16.65 ns |   36.63 |  494.97 μs |   12.34 ns |   27.14 |  1.35 x
  2 x 1 x 100000 |   50.1 % |  945.96 μs |    9.44 ns |   20.77 |  643.56 μs |    6.42 ns |   14.13 |  1.47 x
  2 x 1 x 100000 |   80.1 % |  647.65 μs |    4.04 ns |    8.90 |  625.24 μs |    3.90 ns |    8.59 |  1.04 x
  2 x 1 x 100000 |   99.0 % |  429.32 μs |    2.17 ns |    4.77 |  481.60 μs |    2.43 ns |    5.35 |  0.89 x
  2 x 1 x 100000 |  100.0 % |  471.20 μs |    2.36 ns |    5.18 |  419.94 μs |    2.10 ns |    4.62 |  1.12 x
1 x 1 x 2 x 1 x 10000 |    0.1 % |   43.43 μs | 2068.14 ns | 4549.91 |   21.44 μs | 1021.10 ns | 2246.41 |  2.03 x
1 x 1 x 2 x 1 x 10000 |    1.1 % |   45.44 μs |  212.32 ns |  467.10 |   28.08 μs |  131.22 ns |  288.68 |  1.62 x
1 x 1 x 2 x 1 x 10000 |    5.1 % |   52.11 μs |   51.59 ns |  113.50 |   44.14 μs |   43.70 ns |   96.14 |  1.18 x
1 x 1 x 2 x 1 x 10000 |   19.9 % |   75.65 μs |   19.05 ns |   41.92 |   79.29 μs |   19.97 ns |   43.94 |  0.95 x
1 x 1 x 2 x 1 x 10000 |   49.7 % |  113.96 μs |   11.46 ns |   25.20 |  126.28 μs |   12.69 ns |   27.93 |  0.90 x
1 x 1 x 2 x 1 x 10000 |   80.1 % |   91.50 μs |    5.71 ns |   12.56 |  110.59 μs |    6.90 ns |   15.18 |  0.83 x
1 x 1 x 2 x 1 x 10000 |   99.0 % |   84.66 μs |    4.28 ns |    9.41 |  105.26 μs |    5.32 ns |   11.70 |  0.80 x
1 x 1 x 2 x 1 x 10000 |  100.0 % |   85.04 μs |    4.25 ns |    9.35 |   53.54 μs |    2.68 ns |    5.89 |  1.59 x

@StefanKarpinski
Copy link
Member

What remains to be done here? This PR seems to be in good shape. Is it time to merge?

@chethega
Copy link
Contributor

chethega commented Nov 8, 2018

Is it time to merge?

Nanosoldier to catch weird surprises, and either delight in the nice new number or add "benchmarks beneficial" tag before merging?

@KristofferC
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs = ":master")

@maxbennedich
Copy link
Contributor Author

What happened to the nanosoldier run? Looking at BaseBenchmarkReports, the "daily" report has not been produced since Nov 3, so perhaps the service is having trouble? (Although there was a run completing 3 hours ago.)

@KristofferC
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs = ":master")

@KristofferC
Copy link
Member

Seems to be running now

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@chethega
Copy link
Contributor

Hmm. It looks like we are missing benchmarks for multidimensional findall.

The reported non-improvement for 90% full 1000-element bitvectors looks reproducible with @btime. However, it appears like @btime creates crazy artifacts on the tiny examples. Without this patch:

 julia> for N=[100, 1000, 10_000, 100_000]
       r=rand(N); bx5=r.>0.5; bx1=r.>0.1; bx9 = r.>0.9;
       @show N
       for bx in [bx1, bx5, bx9]
              @btime findall($bx);
              end;
       end
N = 100
  193.788 ns (1 allocation: 896 bytes)
  190.135 ns (1 allocation: 544 bytes)
  134.106 ns (1 allocation: 144 bytes)
N = 1000
  1.602 μs (1 allocation: 7.13 KiB)
  1.701 μs (1 allocation: 4.06 KiB)
  1.086 μs (1 allocation: 1008 bytes)
N = 10000
  18.180 μs (2 allocations: 70.27 KiB)
  44.844 μs (2 allocations: 39.52 KiB)
  18.940 μs (1 allocation: 8.13 KiB)
N = 100000
  225.082 μs (2 allocations: 703.52 KiB)
  491.945 μs (2 allocations: 389.64 KiB)
  212.360 μs (2 allocations: 78.20 KiB)

Observe the superlinear jump for 0.5 density from N=1000 to N=10_000.
My favorite Agner Fog has the following to say: The precise mechanism of branch prediction on haswell is unknown; several parts of the puzzle are explained; a miss costs 15-20 cycles.

For N=1000, we expect 500 misses, which costs 3.75 us. The reported time of 1.7us cannot be true, unless the sneaky CPU uses the benchmark loop to learn the tested bitarray. Since each iteration tests the same pattern, well, there we go.

What I believe happens is that, over the benchmark loop, we fill most of the branch history buffer space with our one critical branch. The history buffer contains counters for possible subsequences; it is apparently large enough to encode a significant fraction of our fixed test array. Benchmarking is hard!

I think we can merge this and need to think about how to avoid this problem in the future: It looks like we overestimated the speed of the old findall by a factor of 3. This issue can apply for all small branchy microbenchmarks. We can either increase size of testsets, or we can interleave tests runs (such that cache and branch predictor are cold).

@KristofferC KristofferC merged commit 96ce5ba into JuliaLang:master Nov 10, 2018
@KristofferC KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 10, 2018
@chethega
Copy link
Contributor

So, just spinning more thoughts. This benchmarking artifact is absolutely mindblowing to me.

What this means is that we have a potential spectre-type gadget: Suppose we have a situation where we can repeatedly ask the kernel to run something like the old findall on a secret buffer, and the branch history table is not cleared. Afaik current spectre mitigations only deal with the BTB and BHT poisoning, not BHT sniffing.

Then we can probably reassemble large parts of 1000 bit secrets, the same way biologists assemble a genome from very short reads. Our read-length is the length of histories stored; Agner Fog suggests that they are quite long (18-32 bit). Neat!

I am sorely tempted to run after this tangent now.

@maxbennedich
Copy link
Contributor Author

Very interesting observations @chethega ! I hadn't considered the effect the branch history buffer has on benchmarking, nor the possible exploit. It boggles my mind too.

@maxbennedich maxbennedich deleted the faster-findall-bitarray branch November 11, 2018 09:10
KristofferC pushed a commit that referenced this pull request Nov 19, 2018
* Faster findall for bitarrays

* Add a few tests for findall for bitarrays

* Code review updates for bitarray findall (#29888)

(cherry picked from commit 96ce5ba)
@KristofferC KristofferC mentioned this pull request Nov 19, 2018
61 tasks
tkf pushed a commit to tkf/julia that referenced this pull request Nov 21, 2018
* Faster findall for bitarrays

* Add a few tests for findall for bitarrays

* Code review updates for bitarray findall (JuliaLang#29888)
KristofferC pushed a commit that referenced this pull request Dec 12, 2018
* Faster findall for bitarrays

* Add a few tests for findall for bitarrays

* Code review updates for bitarray findall (#29888)

(cherry picked from commit 96ce5ba)
KristofferC pushed a commit that referenced this pull request Feb 11, 2019
* Faster findall for bitarrays

* Add a few tests for findall for bitarrays

* Code review updates for bitarray findall (#29888)

(cherry picked from commit 96ce5ba)
KristofferC pushed a commit that referenced this pull request Feb 20, 2020
* Faster findall for bitarrays

* Add a few tests for findall for bitarrays

* Code review updates for bitarray findall (#29888)

(cherry picked from commit 96ce5ba)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants