Dynamic option for blas_set_num_threads() #213

hiccup7 · 2015-05-13T16:54:19Z

I have been using Python with MKL BLAS, which defaults to dynamically setting the number of threads used by each BLAS function. See:
https://software.intel.com/en-us/node/528546
https://software.intel.com/en-us/node/528547

In Julia, I would expect comparable speed from OpenBLAS by calling blas_set_num_threads(CPU_CORES), but I would get a cold slap in the face. I learned that OpenBLAS will automatically use one thread for small arrays, but otherwise use exactly the number of threads specified by blas_set_num_threads(). In other words, my understanding is that OpenBLAS doesn't have MKL BLAS' dynamic option.

As an example, on my Haswell CPU, where Julia reports CPU_CORES as 8, OpenBLAS' dgemv() function (from the develop branch) runs fastest with blas_set_num_threads(2). It is not practical or realistic for me to put blas_set_num_threads() before each (hidden) call to a BLAS function.

For an Intel CPU with hyperthreading, better OpenBLAS performance would occur by using the number of physical cores instead of logical cores. For example, blas_set_num_threads(CPU_CORES >> 1) For portability, I suggest Julia includes a new constant: CPU_PHYSICAL_CORES.

I realize that there is already an effort to provide MKL BLAS as a build and shipping option for Julia (JuliaLang/julia#10969). Assuming that OpenBLAS won't go away, however, it would be helpful if Julia provided a layer of abstraction to make OpenBLAS as performant and easy to use as MKL BLAS, and to make Julia code portable between builds with either one.

Specifically, I suggest that blas_set_num_threads(-1) causes Julia to use a dynamic number of threads. When built with MKL BLAS, this would cause MKL BLAS to effectively act like MKL_DYNAMIC is True. When built with OpenBLAS, a lookup table for each BLAS function would determine the maximum number of threads to use. There would be a different lookup table for each CPU architecture, such as Intel Haswell or Intel SandyBridge. Preferably, the lookup table would be in source code so that it could be tuned by each user. I would expect a lot of pull requests by the community for several months on the lookup tables for various CPUs, but I believe they would appreciate the speedup.

The text was updated successfully, but these errors were encountered:

sturlamolden · 2015-05-13T19:51:04Z

The analysis here is wrong. You generally get best performance by using the number of logical cores. That is what MKL does as well.

sturlamolden · 2015-05-13T19:57:01Z

See here what happens when we go from 4 threads (number of physical cores) to 8 threads (number of logical cores):

http://nbviewer.ipython.org/urls/dl.dropboxusercontent.com/u/12464039/kdtree-jan21-simd.ipynb

Why? Because if you only have half as many threads as logical cores, each thread has a 50 % chance of being scheduled to an idle core.

stevengj · 2015-05-13T21:34:20Z

cc: @xianyi ... it really seems like this should be filed as an OpenBLAS issue.

tknopp · 2015-05-13T22:59:24Z

yes I also do not see why this is a Julia issue. This would also be a very local optimization that would certainly fail in situations, where thread pools are fighting with each other.

tkelman · 2015-05-14T01:48:23Z

related: JuliaLang/julia#5991

ScottPJones · 2015-05-14T06:54:00Z

I also would be very concerned about this being done as a fixed amount, or with only local knowledge.
I saw some other issues with overloading a machine, using @async. If you have hundreds of Julia processes running, and they are all setting the number of threads to use to the number of virtual processors, I think you'll get worse performance than simply using 1 thread per process.

hiccup7 · 2015-05-14T18:17:43Z

While MKL BLAS only needs to optimize the number threads for Intel CPUs, OpenBLAS also supports AMD, MIPS and ARM CPUs. It is a big task for OpenBLAS to implement a dynamic option with all the CPU architectures (and all the BLAS functions) it supports. Also, OpenBLAS has a very small team of volunteer contributors. Therefore, OpenBLAS has wisely chosen to push the responsibility of optimizing the number threads up to parent (language).

BLAS functions dominate my speed experience of a scientific programming language. If Julia starts supporting MKL BLAS on Windows, then, for me, Julia begins to be competitive with Python, which is already has free distributions with MKL BLAS. But if Julia remains OpenBLAS-only on Windows, and the Julia team points the finger at OpenBLAS for lack of a dynamic threading option, then the Julia language becomes a non-competitive no-go for me. For Julia users without an Intel CPU, an OpenBLAS dynamic threading option is the only way I see to make Julia scientific language both speed-competitive and usable.

My opening post is an attempt to provide a technical solution, taking into account the large number of Julia contributors & users vs. the tiny team of OpenBLAS contributors. My suggestion of using a Julia source file is designed to empower all Julia users. Compare this to a solution that is only testable by contributors with the ability to build OpenBLAS from source. In other words, I propose that we will more likely get success by leveraging a large number of unskilled users than to rely on a tiny team of highly-skilled developers (who don't have time for this big task).

For those who point out the possibility of the dynamic option providing non-optimum performance in some cases, I would stress that this is a proposal for an option. I am in favor of still providing the present means of specifying the number of threads used by OpenBLAS.

tkelman · 2015-05-14T18:36:54Z

@hiccup7 Julia is also a volunteer project with a relatively small number of contributors. The suggestions are fine, but the changes you want to see are much more likely to happen sooner if you start rolling up your sleeves and contributing code. You don't need to write pages of arguments to convince us that this would be a good idea, you need to make our lives easier by doing some of the work yourself. We'd rather review pull requests than read long issue threads that state facts we are already well aware of.

But if Julia remains OpenBLAS-only on Windows

As I've already said multiple times, it isn't. You can use any library you have access to via ccall. Unless you can resolve licensing questions and help get a binary build of Julia that uses MKL to pass its tests (and as I've said elsewhere, this is an especially difficult task on Windows), you can do this today, and it would be a much more productive way to spend your time.

tknopp · 2015-05-14T18:44:49Z

Furthermore I don't see the actual issue. One can simply compile Julia with MKL and use it. Why is it so important that Julia comes distributed with MKL? We do see very very rarely discussions where the BLAS performance has been an issue for a user.

nalimilan · 2015-05-14T20:13:43Z

@hiccup7 You seem to start with the premise that it would be easier to write this in Julia than in OpenBLAS, which I don't think is really true. Indeed, OpenBLAS developers are a small team a highly skilled people, but implementing the kind of crude performance tuning that you describe should not require a deep understanding of the hand-optimized kernels for all architectures. Somebody able to implement this in Julia would probably be able to do this in OpenBLAS too.

hiccup7 · 2015-05-15T01:33:19Z

@tkelman , using ccall to access MKL BLAS functions instead of having Julia use MKL BLAS under the hood for linear algebra functions causes Julia to no longer be a high-level language. I wouldn't take a backwards step from Python in ease of code development and readability.

@tknopp , BLAS performance dominates the language performance for DSP code like mine. Also, compare the price of $29 for MKL with the Anaconda distribution of Python to $1199 for an MKL license from Intel for C++ and FORTRAN to support Julia builds. For hobby Windows and OSX users, the choice is clear!

@nalimilan , Compare the number of persons with the skill and time to change pure-Julia source files (all Julia users have the skill to use a text editor) to those with the skill, time and disk space to build OpenBLAS from source. I am trying to improve the convenience and odds of community support.

tkelman · 2015-05-15T01:52:25Z

Again, look at JuliaLang/julia#5991. Having a convenient method of collecting and a place to store performance data for various numbers of threads, problem sizes, and processors would be great. To my knowledge no such centralized database exists today. This is not a Julia-specific issue though, it cuts across practically all of scientific computing. And the data that gets collected will be very specific to the blas implementation being used, even the exact version of that implementation since they're under constant development. The right place to make use of this data is at the blas implementation level. We can certainly collect it from high-level languages though, as long as the binding overhead is low, or at least well-quantified.

using ccall to access MKL BLAS functions instead of having Julia use MKL BLAS under the hood for linear algebra functions causes Julia to no longer be a high-level language.

I really don't mean to be rude here, but this is ridiculous. Take a look at linalg/blas.jl - it's not even very complicated code. You can easily make a package and wrap up any alternate BLAS implementation in a more user-friendly API so your application code doesn't have to use ccall directly. We don't currently have the infrastructure to easily plug in a package's implementation of BLAS in place of base Julia's for the very highest-level operations like A * B, that functionality needs to be designed and implemented first. But if someone works on it I'm sure it could be done. We'd welcome the help if you want to start trying. You could for example write @mkl macros to redirect linear algebra operations to call a different implementation, as one idea.

I wouldn't take a backwards step from Python in ease of code development and readability.

Then don't? Or, help us improve matters. You paid $0 for Julia, remember.

ViralBShah · 2015-05-15T08:27:23Z

I would like to say that we want to do all these things, and will eventually get there. The faster way to get there is with a PR.

tknopp · 2015-05-15T08:27:57Z

This is ridiculous. Why should anybody of this open source project care how much users have to pay for a commercial product from Intel? This project is not about free as in free beer.

The reason people use Julia is in 99% not the performance of A*B but that the code that can not be written as A*B is fast.

tknopp · 2015-05-15T08:35:15Z

@ViralBShah: I think it really needs a distinction between Julia, the open source project and some Julia distribution that ships with commercial binary blobs.

ViralBShah · 2015-05-15T08:50:53Z

I believe the distinction is clear. How do you propose making it clear further?

nalimilan · 2015-05-15T09:03:41Z

@nalimilan , Compare the number of persons with the skill and time to change pure-Julia source files (all Julia users have the skill to use a text editor) to those with the skill, time and disk space to build OpenBLAS from source. I am trying to improve the convenience and odds of community support.

@hiccup7 Maybe you don't realize it, but all people building Julia from source (which is a prerequisite for making serious changes to its code base) already build OpenBLAS without doing anything special.

tknopp · 2015-05-15T09:30:33Z

@ViralBShah: Well is it really a goal of the Julia project to ship with MKL? JuliaLang/julia#10969. I would expect this to be part of an independent distribution. Look at Python. It does also not ship with MKL. Or the linux kernel, does it come with NVIDIA graphics drivers? No. Instead there is a plugin interface for making it possible to do these things. And in Julia these are the packages. Instead of shipping with MKL it would IMHO be better to first do the "default package" refactoring and after that it will be possible to have different distributions of Julia. But you can then also "upgrade" your Julia with MKL by simply installing the MKL package.

ScottPJones · 2015-05-15T09:43:57Z

👍 @tknopp I’d love to help on getting a light-weight Julia, IMO the first thing that has to go is anything GPLed... (I know that it is now an option to not include them, but that should be the default).

ViralBShah · 2015-05-15T09:55:54Z

The goal of the Julia project, IMO, is to provide the best user experience to users using Julia. Of course, we are unlikely to have the default julia distribution depend on anything that is not open source, because of many reasons that all of us are familiar with.

We can certainly have a secondary distribution of Julia+MKL, assuming someone leads the charge and makes it happen. JuliaLang/julia#10969 is about having alternate binaries, not replacing OpenBLAS with MKL.

ViralBShah · 2015-05-15T09:58:16Z

We are digressing from the original thread here. Let's keep the discussion restricted to openblas and threading. I request all other discussion can move elsewhere - there are issues and mailing list threads on all these topics.

hiccup7 · 2015-05-15T13:50:32Z

With Python, my user source code is the same whether my distribution uses MKL or some other BLAS library. With Julia, I would like same feature. Given my large investment of time in past and future source code production, source code portability between platforms and BLAS libraries is part of what would attract me to move from MATLAB and Python to Julia.

The traditional division of labor is based on code boundaries with local optimization. But I observe that OpenBLAS has 3 contributors that do 95% of the commits. I am attempting out-of-the-box thinking to work around this limitation to tackling a complex problem involving multiple chip vendors. This is a similar problem that OpenCL has trying to enter a market dominated by NVIDIA's CUDA for GPUs. OpenCL has official support (but little effort) by all GPU manufacturers, yet it's performance and market share remain significantly worse.

Julia could solve the BLAS optimization problem by using vendor-tuned BLAS libraries as an option to OpenBLAS. In other words, tell ARM and AMD that Julia already has a free license from Intel for MKL BLAS, and ask them if they want to be left behind in the competition.

Another way is to approach ARM and AMD about contributing to OpenBLAS. It's like how the whole industry is cooperating on LibreOffice to compete against the current market leader Microsoft Office. And they are succeeding. ARM and AMD could each put a full-time engineer on OpenBLAS to achieve the thread tuning and, of course, new vendor-specific kernels for their employer.

Julia team members may not be aware of corporate involvement in the current open-source projects. Continuum Analytics, who puts out the Anaconda distribution of Python, has partnered with Intel, Microsoft and Red Hat. Of course there is Google's involvement in Linux. And the list goes on. If Julia wants to be faster than Python, it will attract corporate attention. Julia can leverage these corporations' desire not to be left behind in the competition. This will get the corporations to donate software and add contributors to Julia and/or OpenBLAS. After all, these corporations are the ones who will monetize the results.

tkelman · 2015-05-15T22:14:38Z

@hiccup7 Yes. Please help fix it. Lines of code are more valuable than paragraphs of issue comments.

ScottPJones · 2015-05-15T23:06:29Z

@hiccup7 Yep, fixing things is the only way to get some respect around here! 😀

timholy · 2015-05-16T11:04:40Z

As I'm sure @ScottPJones knows (given the smiley) but worth making clear to newcomers: it's less nefarious than that. It's just that everyone who is contributing is already busy with their own priorities. So writing volumes about what should be done is usually just a waste of time, typically not because people disagree but simply because there's no one available to implement the agenda.

ScottPJones · 2015-05-16T12:00:25Z

Yes, once you start actually fixing problems, instead of complaining, then everybody is incredibly helpful, even to an old PITA like me! 😀

tkelman · 2015-05-16T17:23:36Z

Seriously. We love contributors who know what they're talking about, it's important that we try to get more of them. I apologize if I'm coming across as hostile in any way, that's not my intention, I'm just recommending a more productive way to direct your energies.

edit: we're happy to provide direction and point out areas where we could use help, and if you are unsure about an idea and would like to get feedback about it we can definitely do so. So far we agree with what you're saying but don't have the labor/time to do much about it right away without help. If you want to create a BlasTune.jl package to start experimenting with ideas here, we'd all be in support of that.

ViralBShah · 2015-05-17T10:22:13Z

+1 to @tkelman 's comment above.

ViralBShah · 2015-05-17T10:23:44Z

@hiccup7 You have already contributed to Yeppp.jl - so you know the drill to a large extent. This one is a bit more daunting, but I fully suspect that you can do it pretty quickly, and we will all be happy to help and will be grateful for the contribution. :-)

hiccup7 · 2015-05-18T17:59:36Z

This discussion has been helpful for me to understand what to look for in an HPC language for DSP. The OpenBLAS and Julia teams are my heroes for creating open-source software to compassionately break users free from the shackles of corporations, but I see that multi-chip-vendor dynamic threading control is too big of a task for (hobbiest) me or these volunteer organizations to take on in the near future. This clarified for me to focus on languages that offer integration with vendor-tuned BLAS libraries. Since I am only aware of Intel offering a vendor-tuned BLAS library, I will continuing buying machines with Intel CPUs and using languages built with MKL BLAS.

Julia has a beautiful syntax that I like better than Python. Julia's syntax has the potential to support faster execution and code creation than Python. Once Julia makes the following changes, I plan to switch from Python/Anaconda to Julia for an improvement:

Shipping with MKL BLAS
LLVM 3.6 for Intel Haswell SIMD support
Multi-threading within inner loops work
Source-level debugger comparable to Python's Spyder

I am thankful to the Julia team for teaching me about secondary methods to speed up code execution, such as a macro preprocessor, multiple dispatch and pre-allocation of output arrays. What is in my reach as a hobbiest developer is to use these concepts in Python. My main area of interest is in applying DSP rather than creating another language for implementing DSP.

timholy · 2015-05-18T18:09:13Z

You can build Julia with MKL yourself, it's just not shipping that way.

tknopp · 2015-05-18T18:22:59Z

@timholy: He wants to avoid paying for MKL, which can be circumvented if an organization buys a license. And honestly I am not sure if we need discussions like "if you don't support this I stay with python" here. Why should anybody switch if he/she's happy with Python?

timholy · 2015-05-18T18:37:35Z

Thanks for clarifying.

tkelman · 2015-05-18T22:27:13Z

You can build Julia with MKL yourself, it's just not shipping that way.

I don't think the build system will handle this properly on Windows until someone starts trying it and patching away at all of the gfortran assumptions. MKL and ifort are not gfortran-compatible on Windows the way they are on Unix platforms.

tkelman · 2015-05-18T22:32:22Z

I am only aware of Intel offering a vendor-tuned BLAS library

Pretty much every chip vendor does this. AMD has ACML, IBM has ESSL, Nvidia has NVBLAS, etc etc. Intel's MKL just happens to be much better-known and more widely used, and has more development resources put behind it than most of the others. I'm pretty sure OpenBLAS is faster than ACML and ESSL these days.

milktrader · 2015-05-18T22:59:51Z

You can also start using LLVM 3.6

julia> versioninfo()
Julia Version 0.4.0-dev+4335
Commit 2c9633e* (2015-04-18 15:17 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.6.0

tkelman · 2015-05-18T23:04:33Z

Subject to JuliaLang/julia#11083 and JuliaLang/julia#11085 on Windows, and a rather time-consuming source build process. If anyone ever wants custom unofficial nightly binaries with some unusual configuration, ask and we can probably make that happen.

ViralBShah · 2015-05-19T05:51:18Z

@tknopp I don't think switching is the right word. People will add Julia to the list of tools they use to get things done. @hiccup7 Julia does not yet provide what the Anaconda Python distribution offers. Hopefully we will have a solution for this in the future - whether it is a separate Julia+MKL distribution, or dynamic tuning of threads in OpenBLAS, or in Julia. We all know this is needed.

I do agree that this is not the place to discuss "I won't use Julia unless...". Those words come across as a bit negative and are not encouraging. More general discussion needs to be on the mailing list, and issues should really be about solutions and code.

hiccup7 · 2015-05-19T19:14:58Z

The reason I gave a short list of reasons why Anaconda Python provides faster execution and code development for me than Julia was to provide specific constructive feedback. I didn't want to mislead the Julia team into thinking that dynamic BLAS threading control was the only thing holding me back from using Julia. I am sensitive to how specific feedback makes the goal seem attainable and encourages the Julia team (who I do support). I also ask the Julia team to be sensitive to Python users who go through a lot of effort to test Julia and find it slower despite the deceptive benchmarks and "best-of-breed C and Fortran libraries for linear algebra, random number generation, signal processing, and string processing" claim on the home page.

ihnorton · 2015-05-19T19:34:42Z

Thanks for the feedback. There seem to be existing issues covering all of the things mentioned in this thread, so closing in favor of those (#5991 for dynamic BLAS threads, several MKL issues, the debugger issue, the fixed in LLVM3.6 tag, etc.)

tkelman · 2015-05-19T20:29:18Z

best-of-breed C and Fortran libraries for linear algebra, random number generation, signal processing, and string processing

I believe the words "open source" are also included somewhere near that sentence...

tkelman · 2015-05-19T21:58:48Z

sensitive to Python users who go through a lot of effort to test Julia and find it slower despite the deceptive benchmarks

If it was really a lot of effort to test things on an evaluation basis (download, install, try?), and there are actionable things we can do mitigate whatever (yet unspecified) trouble you had, please let us know.

On the benchmarks, no deception has ever been intended - if your application is entirely dominated by BLAS calls and Python distribution X (or R distribution Y, or Matlab, etc) with MKL is faster for you than Julia with OpenBLAS, no one here would tell you that picking Julia over Python would be your best choice. The intent of the benchmarks is that custom code implemented in Julia, using loops and recursion etc, can be fast, which is not the case in Python.

ViralBShah · 2015-05-20T04:59:18Z

I added open source to best of breed on the website. There is no deception intended - and there are benchmarks about blas matrix multiply as well. Yes, you can always download proprietary BLAS libraries and get higher performance, either in Julia or in Python. Currently the effort to use MKL is higher with Julia, and we hope to resolve it over time. Your feedback has been valuable and has been heard!

lionpeloux · 2015-10-17T21:15:44Z

Dear all,

I might not be at the right place to post my question and I apology for that ... but at this point, it seems to be the only discussion I've ever found on the net about accessing MKL from Julia with ccall.

I've been struggling to do that on my mac and I've posted for help here : https://groups.google.com/forum/#!topic/julia-users/pairCLeym0g

If any one of you can post an small example or give a clue on this thread, it would be very useful (for me of course, but for the many others who will probably attempt to do that).

Any news about Julia being shipped with MKL would be appreciate too.

Thanks again,
Lionel

KristofferC · 2015-10-17T22:38:30Z

I gave an answer at the mailing list that might help.

hiccup7 mentioned this issue May 13, 2015

Mingw Gfortran on Windows? scipy/scipy#2829

Closed

ihnorton added performance Must go faster speculative Whether the change will be implemented is speculative labels May 13, 2015

ViralBShah added the upstream The issue is with an upstream dependency, e.g. LLVM label May 14, 2015

ViralBShah added the linear algebra label May 17, 2015

xianyi mentioned this issue Oct 29, 2015

Dynamic number of threads OpenMathLib/OpenBLAS#678

Closed

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

This issue was closed.

Dynamic option for blas_set_num_threads() #213

Dynamic option for blas_set_num_threads() #213

Comments

hiccup7 commented May 13, 2015

sturlamolden commented May 13, 2015

sturlamolden commented May 13, 2015

stevengj commented May 13, 2015

tknopp commented May 13, 2015

tkelman commented May 14, 2015

ScottPJones commented May 14, 2015

hiccup7 commented May 14, 2015

tkelman commented May 14, 2015

tknopp commented May 14, 2015

nalimilan commented May 14, 2015

hiccup7 commented May 15, 2015

tkelman commented May 15, 2015

ViralBShah commented May 15, 2015

tknopp commented May 15, 2015

tknopp commented May 15, 2015

ViralBShah commented May 15, 2015

nalimilan commented May 15, 2015

tknopp commented May 15, 2015

ScottPJones commented May 15, 2015

ViralBShah commented May 15, 2015

ViralBShah commented May 15, 2015

hiccup7 commented May 15, 2015

tkelman commented May 15, 2015

ScottPJones commented May 15, 2015

timholy commented May 16, 2015

ScottPJones commented May 16, 2015

tkelman commented May 16, 2015

ViralBShah commented May 17, 2015

ViralBShah commented May 17, 2015

hiccup7 commented May 18, 2015

timholy commented May 18, 2015

tknopp commented May 18, 2015

timholy commented May 18, 2015

tkelman commented May 18, 2015

tkelman commented May 18, 2015

milktrader commented May 18, 2015

tkelman commented May 18, 2015

ViralBShah commented May 19, 2015

hiccup7 commented May 19, 2015

ihnorton commented May 19, 2015

tkelman commented May 19, 2015

tkelman commented May 19, 2015

ViralBShah commented May 20, 2015

lionpeloux commented Oct 17, 2015

KristofferC commented Oct 17, 2015