-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use max(1, Sys.CPU_THREADS)
BLAS threads for aarch64
.
#46085
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -582,7 +582,11 @@ function __init__() | |
Base.at_disable_library_threading(() -> BLAS.set_num_threads(1)) | ||
|
||
if !haskey(ENV, "OPENBLAS_NUM_THREADS") | ||
BLAS.set_num_threads(max(1, Sys.CPU_THREADS ÷ 2)) | ||
if Sys.ARCH === :aarch64 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure this is right. It's only on CPUs where we know There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Probably it's only/mainly Intel which does hyperthreading no? You're excluding 32-bit ARM, which I also doubt does hyperthreading There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AMD Cpus also have it (it's called SMT for them). Power I think does it too. For someone running asahi for example this is probably wrong, same for someone running the new Intel ones. But it was wrong before and i don't think we figured out a way to query performance cores on linux. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Power does 4x or 8x, depending on the chip, meaning we may want to divide by 4 or 8 for Power? Is that detectable?
On asahi, would the 4big4L read as 8 threads, and 8big2L read as 10 threads?
I think this is a rather strong statement, in that 5 doesn't seem obviously better than 10 for 8 big core + 2 little core systems.
Seems unnecessary when using julia> foo() = Sys.ARCH === :aarch64
foo (generic function with 1 method)
julia> @code_typed foo()
CodeInfo(
1 ─ return false
) => Bool but I can use
Do you think Asahi linux is more common than Raspberry Pi 4, Graviton, Ampere, A64FX? Not sure about Nvidea's products. So it's possible, but I do not know the market shares here. There are Intel CPUs without hyperthreading, and you can also turn it off in the bios for CPUs with it (this is reasonably common in HPC environments). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alder lake is really annoying because the E cores don't have SMT but the P cores do. Ideally we should always find the number of P cores and set NUM_THREADS to that. A quick test shows that, at least for BLAS, adding the E cores makes it quite a bit slower. versioninfo()
Julia Version 1.9.0-DEV.688
Commit 3eaed8b54c (2022-05-31 15:31 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × 12th Gen Intel(R) Core(TM) i5-12600K
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, alderlake)
Threads: 6 on 16 virtual cores
Environment:
JULIA_NUM_THREADS = 6
JULIA_NUM_PRECOMPILE_TASKS = 6
julia> using LinearAlgebra
A1=rand(8192,8192); A2=copy(A1);
julia> @btime lu!($A1);
1.022 s (2 allocations: 64.05 KiB)
julia> BLAS.set_num_threads(10)
julia> @btime lu!($A2);
1.847 s (2 allocations: 64.05 KiB)
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about with I get the argument for mixing cores optimized for absolute performance with cores optimized for performance/area, but with it hurting benchmarks, so far that sounds more like marketing than reality. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see you also set There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I set that because in macos there were some issues with the E cores and precompiling (it made the sound crackle). With MKL and 6 cores it didn't change much, but with 10 it didn't get worse interestingly. So mkl might do something not to use them There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That seems likely. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear, when I said "Intel" above I was referring to x86-64 in general, not chips made by Intel specifically. Anyway, since it was said that most ARM CPUs (and not just Aaarch64, and not just on macOS) will typically not have multithreading, then the condition should change to Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) ∈ ("aarch64", "armv6l", "armv7l") ? It remains the already mentioned problem of Asahi Linux, where we don't set correctly the total number of CPUs, but that's a more general problem of detecting the efficiency/power cores, and in any case this is only a default value of threads, which can be always changed (and people strongly caring about the precise number of threads to be used on their machine should set it manually to be safe against changes of defaults in any case). |
||
BLAS.set_num_threads(max(1, Sys.CPU_THREADS)) | ||
else | ||
BLAS.set_num_threads(max(1, Sys.CPU_THREADS ÷ 2)) | ||
end | ||
end | ||
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, this should look at
GOTO_NUM_THREADS
ANDOMP_NUM_THREADS
as well: https://github.com/xianyi/OpenBLAS/blob/c43ec53bdd00d9423fc609d7b7ecb35e7bf41b85/README.md#setting-the-number-of-threads-using-environment-variablesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GOTO_NUM_THREADS comes from the original GOTOBLAS library. Nobody should be using it!
For OMP, I thought that only applies if you build with OpenMP for threading - which we don't. I think we should only look at OPENBLAS_NUM_THREADS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of fact OpenBLAS respects those variables. If you set them you can see with BLAS.get_num_threads that OpenBLAS changes the setting accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup it does. I was just saying that we may only want to use one of them and only document that. However, I don't have a strong opinion if we want to use the other ones - but what should we do if more than one of those environment variables are present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of those variables is set, we shouldn't override it. See #46118 🙂