-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-ideal default number of BLAS threads on aarch64-apple-darwin #934
Comments
Seems that it's better if I set the number of BLAS threads to 4. Wasn't this done automatically before?
|
Note: Use seed! to fix the test matrix to be consistent between tests. using Random
using LinearAlgebra
using BenchmarkTools
Random.seed!(46071)
A = rand(8192, 8192);
BLAS.get_num_threads()
sum(A)
@btime lu!(copy(A));
@benchmark lu!(B) setup=( B=copy($A) )
sum(A)
B = copy(A);
@btime lu!(B);
julia> BLAS.get_num_threads()
6
julia> sum(A)
3.3553450714250512e7
julia> @btime lu!(copy(A));
2.189 s (5 allocations: 512.06 MiB)
julia> @benchmark lu!(B) setup=( B=copy($A) )
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 1.931 s … 2.026 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.948 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.969 s ± 50.677 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.93 s Histogram: frequency by time 2.03 s <
Memory estimate: 64.05 KiB, allocs estimate: 2.
julia> sum(A)
3.3553450714250512e7
julia> @btime lu!(B);
1.897 s (3 allocations: 64.08 KiB)
julia> BLAS.get_num_threads()
3
julia> sum(A)
3.3553450714250512e7
julia> @btime lu!(copy(A));
2.891 s (5 allocations: 512.06 MiB)
julia> @benchmark lu!(B) setup=( B=copy($A) )
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 2.754 s … 2.872 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.813 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.813 s ± 83.977 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.75 s Histogram: frequency by time 2.87 s <
Memory estimate: 64.05 KiB, allocs estimate: 2.
julia> sum(A)
3.3553450714250512e7
julia> @btime lu!(B);
2.636 s (3 allocations: 64.08 KiB) Update:
julia> BLAS.get_num_threads()
3
julia> BLAS.set_num_threads(6)
julia> BLAS.get_num_threads()
6
julia> sum(A)
3.3553450714250512e7
julia> @btime lu!(copy(A));
2.037 s (5 allocations: 512.06 MiB)
julia> @benchmark lu!(B) setup=( B=copy($A) )
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 1.898 s … 2.348 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.995 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.080 s ± 237.145 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.9 s Histogram: frequency by time 2.35 s <
Memory estimate: 64.05 KiB, allocs estimate: 2.
julia> sum(A)
3.3553450714250512e7
julia> B = copy(A);
julia> @btime lu!(B);
1.706 s (3 allocations: 64.08 KiB) |
The default number of openblas threads is a challenge. We may need to have a special case for M1 with the performance and efficiency cores stuff. This was the latest update to that logic: JuliaLang/julia#45412 |
On M1 I get: % julia-17 -e 'using LinearAlgebra; @info "" VERSION BLAS.get_num_threads()'
┌ Info:
│ VERSION = v"1.7.0"
└ BLAS.get_num_threads() = 8
% julia -e 'using LinearAlgebra; @info "" VERSION BLAS.get_num_threads()'
┌ Info:
│ VERSION = v"1.9.0-DEV.983"
└ BLAS.get_num_threads() = 2 The "right" number of threads should be 4, not 8, nor 2. |
Also, I can confirm I get identical performance between Julia v1.7 and |
That seems to be it. I ran the experiment on 1.8.0-rc3
and got
and the same results on 1.7.2. Is 4 the right number? Is there a guarantee that if we ask for 4 threads they run on the performance cores? |
@chriselrod may know. |
1 thread per performance core is optimal, which is what Julia detects: JuliaLang/julia#44072 Note that each performance core has only a single thread, so we're not getting twice the cores from |
If you pick 4 cores, are you guaranteed to get scheduled on the performance cores? |
You aren't guaranteed anything unfortunately. But the scheduler is pretty good in moving the right operations to the right places. |
So basically, do we need to set to 4 OpenBLAS threads on M1? The PR I linked above is where we may need to specialize for M1. |
Not 4, but simply don't divide by 2 in https://github.com/JuliaLang/julia/blob/e1739aa42a14b339e6589ceafe74d3ab48474e6e/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L585 |
Would be great if someone can make a PR quick. We probably want to get this into 1.8. |
lu is slower by a factor of 2 on 1.8.0-rc3/M1 Macs
Results from a 2020 MacBook Pro.
On 1.8.0-rc3
and on 1.7.2
The text was updated successfully, but these errors were encountered: