Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ideal default number of BLAS threads on aarch64-apple-darwin #934

Closed
ctkelley opened this issue Jul 16, 2022 · 13 comments · Fixed by JuliaLang/julia#46085
Closed

Non-ideal default number of BLAS threads on aarch64-apple-darwin #934

ctkelley opened this issue Jul 16, 2022 · 13 comments · Fixed by JuliaLang/julia#46085
Labels
performance Must go faster

Comments

@ctkelley
Copy link

ctkelley commented Jul 16, 2022

lu is slower by a factor of 2 on 1.8.0-rc3/M1 Macs

Results from a 2020 MacBook Pro.

On 1.8.0-rc3

julia> A=rand(8192,8192);

julia> @btime lu!($A);
  4.087 s (2 allocations: 64.05 KiB)

and on 1.7.2

julia> A=rand(8192,8192);

julia> @btime lu!($A);
  1.929 s (2 allocations: 64.05 KiB)
@ctkelley ctkelley changed the title lu! slower in 1.8.0-rc3 lu! slower in 1.8.0-rc3: M1 Mac Jul 16, 2022
@ctkelley ctkelley changed the title lu! slower in 1.8.0-rc3: M1 Mac Performance regression: lu! slower in 1.8.0-rc3 on M1 Mac Jul 16, 2022
@ctkelley ctkelley changed the title Performance regression: lu! slower in 1.8.0-rc3 on M1 Mac Performance regression: lu! 2x slower in 1.8.0-rc3 on M1 Mac Jul 16, 2022
@ctkelley
Copy link
Author

Seems that it's better if I set the number of BLAS threads to 4. Wasn't this done automatically before?

julia> A1=rand(8192,8192); A2=copy(A1);

julia> @btime lu!($A1);
  4.101 s (2 allocations: 64.05 KiB)

julia> using LinearAlgebra.BLAS

julia> BLAS.set_num_threads(4)

julia> @btime lu!(A2);
  2.251 s (3 allocations: 64.08 KiB)

@inkydragon
Copy link
Member

inkydragon commented Jul 16, 2022

Confirmed in WSL2.
It looks like the default number of threads for 1.7/1.8 is not the same.

Note: Use seed! to fix the test matrix to be consistent between tests.

using Random
using LinearAlgebra
using BenchmarkTools

Random.seed!(46071)
A = rand(8192, 8192);

BLAS.get_num_threads()

sum(A)
@btime lu!(copy(A));
@benchmark lu!(B)  setup=( B=copy($A) )
sum(A)

B = copy(A);
@btime lu!(B);
  • Version 1.7.3 (2022-05-06)
julia> BLAS.get_num_threads()
6

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(copy(A));
  2.189 s (5 allocations: 512.06 MiB)

julia> @benchmark lu!(B)  setup=( B=copy($A) )
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min  max):  1.931 s    2.026 s  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.948 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.969 s ± 50.677 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █         █                                             █
  █▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.93 s         Histogram: frequency by time        2.03 s <

 Memory estimate: 64.05 KiB, allocs estimate: 2.

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(B);
  1.897 s (3 allocations: 64.08 KiB)
  • Version 1.8.0-rc3 (2022-07-13)
julia> BLAS.get_num_threads()
3

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(copy(A));
  2.891 s (5 allocations: 512.06 MiB)

julia> @benchmark lu!(B)  setup=( B=copy($A) )
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min  max):  2.754 s    2.872 s  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.813 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.813 s ± 83.977 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.75 s         Histogram: frequency by time        2.87 s <

 Memory estimate: 64.05 KiB, allocs estimate: 2.

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(B);
  2.636 s (3 allocations: 64.08 KiB)

Update:

  • Version 1.8.0-rc3 (2022-07-13) + 6 BLAS threads
julia> BLAS.get_num_threads()
3
julia> BLAS.set_num_threads(6)
julia> BLAS.get_num_threads()
6

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(copy(A));
  2.037 s (5 allocations: 512.06 MiB)

julia> @benchmark lu!(B)  setup=( B=copy($A) )
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min  max):  1.898 s     2.348 s  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.995 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.080 s ± 237.145 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █           █                                            █
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.9 s          Histogram: frequency by time         2.35 s <

 Memory estimate: 64.05 KiB, allocs estimate: 2.

julia> sum(A)
3.3553450714250512e7

julia> B = copy(A);

julia> @btime lu!(B);
  1.706 s (3 allocations: 64.08 KiB)

@ViralBShah
Copy link
Member

The default number of openblas threads is a challenge. We may need to have a special case for M1 with the performance and efficiency cores stuff.

This was the latest update to that logic: JuliaLang/julia#45412

@ViralBShah ViralBShah added the performance Must go faster label Jul 16, 2022
@giordano
Copy link
Contributor

giordano commented Jul 16, 2022

On M1 I get:

% julia-17 -e 'using LinearAlgebra; @info "" VERSION BLAS.get_num_threads()'
┌ Info:
│   VERSION = v"1.7.0"
└   BLAS.get_num_threads() = 8
% julia -e 'using LinearAlgebra; @info "" VERSION BLAS.get_num_threads()'
┌ Info:
│   VERSION = v"1.9.0-DEV.983"
└   BLAS.get_num_threads() = 2

The "right" number of threads should be 4, not 8, nor 2.

@giordano giordano changed the title Performance regression: lu! 2x slower in 1.8.0-rc3 on M1 Mac Non-ideal default number of threads on aarch64-apple-darwin Jul 16, 2022
@giordano
Copy link
Contributor

giordano commented Jul 16, 2022

Also, I can confirm I get identical performance between Julia v1.7 and master, when using same number of BLAS threads, so the only issue here is the default number of threads, I renamed the issue accordingly.

@giordano giordano changed the title Non-ideal default number of threads on aarch64-apple-darwin Non-ideal default number of BLAS threads on aarch64-apple-darwin Jul 16, 2022
@ctkelley
Copy link
Author

That seems to be it. I ran the experiment on 1.8.0-rc3

Random.seed!(46071)
A = rand(8192, 8192);
A0=copy(A)
for ith=2:2:8
    A .= A0
    BLAS.set_num_threads(ith)
    lutime = @belapsed lu!($A);
    println("threads = $ith; time = $lutime")
end

and got

threads = 2; time = 4.09720e+00
threads = 4; time = 2.39923e+00
threads = 6; time = 2.26150e+00
threads = 8; time = 2.09293e+00

and the same results on 1.7.2.

Is 4 the right number? Is there a guarantee that if we ask for 4 threads they run on the performance cores?

@ViralBShah
Copy link
Member

@chriselrod may know.

@chriselrod
Copy link
Contributor

1 thread per performance core is optimal, which is what Julia detects: JuliaLang/julia#44072

Note that each performance core has only a single thread, so we're not getting twice the cores from Sys.CPU_THREADS like we are on most x86 systems.

@ViralBShah
Copy link
Member

If you pick 4 cores, are you guaranteed to get scheduled on the performance cores?

@gbaraldi
Copy link
Member

You aren't guaranteed anything unfortunately. But the scheduler is pretty good in moving the right operations to the right places.

@ViralBShah
Copy link
Member

So basically, do we need to set to 4 OpenBLAS threads on M1? The PR I linked above is where we may need to specialize for M1.

@giordano
Copy link
Contributor

@ViralBShah
Copy link
Member

Would be great if someone can make a PR quick. We probably want to get this into 1.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants