-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic option for blas_set_num_threads() #213
Comments
The analysis here is wrong. You generally get best performance by using the number of logical cores. That is what MKL does as well. |
See here what happens when we go from 4 threads (number of physical cores) to 8 threads (number of logical cores): http://nbviewer.ipython.org/urls/dl.dropboxusercontent.com/u/12464039/kdtree-jan21-simd.ipynb Why? Because if you only have half as many threads as logical cores, each thread has a 50 % chance of being scheduled to an idle core. |
cc: @xianyi ... it really seems like this should be filed as an OpenBLAS issue. |
yes I also do not see why this is a Julia issue. This would also be a very local optimization that would certainly fail in situations, where thread pools are fighting with each other. |
related: JuliaLang/julia#5991 |
I also would be very concerned about this being done as a fixed amount, or with only local knowledge. |
While MKL BLAS only needs to optimize the number threads for Intel CPUs, OpenBLAS also supports AMD, MIPS and ARM CPUs. It is a big task for OpenBLAS to implement a dynamic option with all the CPU architectures (and all the BLAS functions) it supports. Also, OpenBLAS has a very small team of volunteer contributors. Therefore, OpenBLAS has wisely chosen to push the responsibility of optimizing the number threads up to parent (language). BLAS functions dominate my speed experience of a scientific programming language. If Julia starts supporting MKL BLAS on Windows, then, for me, Julia begins to be competitive with Python, which is already has free distributions with MKL BLAS. But if Julia remains OpenBLAS-only on Windows, and the Julia team points the finger at OpenBLAS for lack of a dynamic threading option, then the Julia language becomes a non-competitive no-go for me. For Julia users without an Intel CPU, an OpenBLAS dynamic threading option is the only way I see to make Julia scientific language both speed-competitive and usable. My opening post is an attempt to provide a technical solution, taking into account the large number of Julia contributors & users vs. the tiny team of OpenBLAS contributors. My suggestion of using a Julia source file is designed to empower all Julia users. Compare this to a solution that is only testable by contributors with the ability to build OpenBLAS from source. In other words, I propose that we will more likely get success by leveraging a large number of unskilled users than to rely on a tiny team of highly-skilled developers (who don't have time for this big task). For those who point out the possibility of the dynamic option providing non-optimum performance in some cases, I would stress that this is a proposal for an option. I am in favor of still providing the present means of specifying the number of threads used by OpenBLAS. |
@hiccup7 Julia is also a volunteer project with a relatively small number of contributors. The suggestions are fine, but the changes you want to see are much more likely to happen sooner if you start rolling up your sleeves and contributing code. You don't need to write pages of arguments to convince us that this would be a good idea, you need to make our lives easier by doing some of the work yourself. We'd rather review pull requests than read long issue threads that state facts we are already well aware of.
As I've already said multiple times, it isn't. You can use any library you have access to via |
Furthermore I don't see the actual issue. One can simply compile Julia with MKL and use it. Why is it so important that Julia comes distributed with MKL? We do see very very rarely discussions where the BLAS performance has been an issue for a user. |
@hiccup7 You seem to start with the premise that it would be easier to write this in Julia than in OpenBLAS, which I don't think is really true. Indeed, OpenBLAS developers are a small team a highly skilled people, but implementing the kind of crude performance tuning that you describe should not require a deep understanding of the hand-optimized kernels for all architectures. Somebody able to implement this in Julia would probably be able to do this in OpenBLAS too. |
@tkelman , using @tknopp , BLAS performance dominates the language performance for DSP code like mine. Also, compare the price of $29 for MKL with the Anaconda distribution of Python to $1199 for an MKL license from Intel for C++ and FORTRAN to support Julia builds. For hobby Windows and OSX users, the choice is clear! @nalimilan , Compare the number of persons with the skill and time to change pure-Julia source files (all Julia users have the skill to use a text editor) to those with the skill, time and disk space to build OpenBLAS from source. I am trying to improve the convenience and odds of community support. |
Again, look at JuliaLang/julia#5991. Having a convenient method of collecting and a place to store performance data for various numbers of threads, problem sizes, and processors would be great. To my knowledge no such centralized database exists today. This is not a Julia-specific issue though, it cuts across practically all of scientific computing. And the data that gets collected will be very specific to the blas implementation being used, even the exact version of that implementation since they're under constant development. The right place to make use of this data is at the blas implementation level. We can certainly collect it from high-level languages though, as long as the binding overhead is low, or at least well-quantified.
I really don't mean to be rude here, but this is ridiculous. Take a look at
Then don't? Or, help us improve matters. You paid $0 for Julia, remember. |
I would like to say that we want to do all these things, and will eventually get there. The faster way to get there is with a PR. |
This is ridiculous. Why should anybody of this open source project care how much users have to pay for a commercial product from Intel? This project is not about The reason people use Julia is in 99% not the performance of |
@ViralBShah: I think it really needs a distinction between Julia, the open source project and some Julia distribution that ships with commercial binary blobs. |
I believe the distinction is clear. How do you propose making it clear further? |
@hiccup7 Maybe you don't realize it, but all people building Julia from source (which is a prerequisite for making serious changes to its code base) already build OpenBLAS without doing anything special. |
@ViralBShah: Well is it really a goal of the Julia project to ship with MKL? JuliaLang/julia#10969. I would expect this to be part of an independent distribution. Look at Python. It does also not ship with MKL. Or the linux kernel, does it come with NVIDIA graphics drivers? No. Instead there is a plugin interface for making it possible to do these things. And in Julia these are the packages. Instead of shipping with MKL it would IMHO be better to first do the "default package" refactoring and after that it will be possible to have different distributions of Julia. But you can then also "upgrade" your Julia with MKL by simply installing the MKL package. |
👍 @tknopp I’d love to help on getting a light-weight Julia, IMO the first thing that has to go is anything GPLed... (I know that it is now an option to not include them, but that should be the default). |
The goal of the Julia project, IMO, is to provide the best user experience to users using Julia. Of course, we are unlikely to have the default julia distribution depend on anything that is not open source, because of many reasons that all of us are familiar with. We can certainly have a secondary distribution of Julia+MKL, assuming someone leads the charge and makes it happen. JuliaLang/julia#10969 is about having alternate binaries, not replacing OpenBLAS with MKL. |
We are digressing from the original thread here. Let's keep the discussion restricted to openblas and threading. I request all other discussion can move elsewhere - there are issues and mailing list threads on all these topics. |
With Python, my user source code is the same whether my distribution uses MKL or some other BLAS library. With Julia, I would like same feature. Given my large investment of time in past and future source code production, source code portability between platforms and BLAS libraries is part of what would attract me to move from MATLAB and Python to Julia. The traditional division of labor is based on code boundaries with local optimization. But I observe that OpenBLAS has 3 contributors that do 95% of the commits. I am attempting out-of-the-box thinking to work around this limitation to tackling a complex problem involving multiple chip vendors. This is a similar problem that OpenCL has trying to enter a market dominated by NVIDIA's CUDA for GPUs. OpenCL has official support (but little effort) by all GPU manufacturers, yet it's performance and market share remain significantly worse. Julia could solve the BLAS optimization problem by using vendor-tuned BLAS libraries as an option to OpenBLAS. In other words, tell ARM and AMD that Julia already has a free license from Intel for MKL BLAS, and ask them if they want to be left behind in the competition. Another way is to approach ARM and AMD about contributing to OpenBLAS. It's like how the whole industry is cooperating on LibreOffice to compete against the current market leader Microsoft Office. And they are succeeding. ARM and AMD could each put a full-time engineer on OpenBLAS to achieve the thread tuning and, of course, new vendor-specific kernels for their employer. Julia team members may not be aware of corporate involvement in the current open-source projects. Continuum Analytics, who puts out the Anaconda distribution of Python, has partnered with Intel, Microsoft and Red Hat. Of course there is Google's involvement in Linux. And the list goes on. If Julia wants to be faster than Python, it will attract corporate attention. Julia can leverage these corporations' desire not to be left behind in the competition. This will get the corporations to donate software and add contributors to Julia and/or OpenBLAS. After all, these corporations are the ones who will monetize the results. |
@hiccup7 Yes. Please help fix it. Lines of code are more valuable than paragraphs of issue comments. |
@hiccup7 Yep, fixing things is the only way to get some respect around here! 😀 |
As I'm sure @ScottPJones knows (given the smiley) but worth making clear to newcomers: it's less nefarious than that. It's just that everyone who is contributing is already busy with their own priorities. So writing volumes about what should be done is usually just a waste of time, typically not because people disagree but simply because there's no one available to implement the agenda. |
Yes, once you start actually fixing problems, instead of complaining, then everybody is incredibly helpful, even to an old PITA like me! 😀 |
Seriously. We love contributors who know what they're talking about, it's important that we try to get more of them. I apologize if I'm coming across as hostile in any way, that's not my intention, I'm just recommending a more productive way to direct your energies. edit: we're happy to provide direction and point out areas where we could use help, and if you are unsure about an idea and would like to get feedback about it we can definitely do so. So far we agree with what you're saying but don't have the labor/time to do much about it right away without help. If you want to create a BlasTune.jl package to start experimenting with ideas here, we'd all be in support of that. |
+1 to @tkelman 's comment above. |
@hiccup7 You have already contributed to Yeppp.jl - so you know the drill to a large extent. This one is a bit more daunting, but I fully suspect that you can do it pretty quickly, and we will all be happy to help and will be grateful for the contribution. :-) |
This discussion has been helpful for me to understand what to look for in an HPC language for DSP. The OpenBLAS and Julia teams are my heroes for creating open-source software to compassionately break users free from the shackles of corporations, but I see that multi-chip-vendor dynamic threading control is too big of a task for (hobbiest) me or these volunteer organizations to take on in the near future. This clarified for me to focus on languages that offer integration with vendor-tuned BLAS libraries. Since I am only aware of Intel offering a vendor-tuned BLAS library, I will continuing buying machines with Intel CPUs and using languages built with MKL BLAS. Julia has a beautiful syntax that I like better than Python. Julia's syntax has the potential to support faster execution and code creation than Python. Once Julia makes the following changes, I plan to switch from Python/Anaconda to Julia for an improvement:
I am thankful to the Julia team for teaching me about secondary methods to speed up code execution, such as a macro preprocessor, multiple dispatch and pre-allocation of output arrays. What is in my reach as a hobbiest developer is to use these concepts in Python. My main area of interest is in applying DSP rather than creating another language for implementing DSP. |
You can build Julia with MKL yourself, it's just not shipping that way. |
@timholy: He wants to avoid paying for MKL, which can be circumvented if an organization buys a license. And honestly I am not sure if we need discussions like "if you don't support this I stay with python" here. Why should anybody switch if he/she's happy with Python? |
Thanks for clarifying. |
I don't think the build system will handle this properly on Windows until someone starts trying it and patching away at all of the gfortran assumptions. MKL and ifort are not gfortran-compatible on Windows the way they are on Unix platforms. |
Pretty much every chip vendor does this. AMD has ACML, IBM has ESSL, Nvidia has NVBLAS, etc etc. Intel's MKL just happens to be much better-known and more widely used, and has more development resources put behind it than most of the others. I'm pretty sure OpenBLAS is faster than ACML and ESSL these days. |
You can also start using LLVM 3.6
|
Subject to JuliaLang/julia#11083 and JuliaLang/julia#11085 on Windows, and a rather time-consuming source build process. If anyone ever wants custom unofficial nightly binaries with some unusual configuration, ask and we can probably make that happen. |
@tknopp I don't think switching is the right word. People will add Julia to the list of tools they use to get things done. @hiccup7 Julia does not yet provide what the Anaconda Python distribution offers. Hopefully we will have a solution for this in the future - whether it is a separate Julia+MKL distribution, or dynamic tuning of threads in OpenBLAS, or in Julia. We all know this is needed. I do agree that this is not the place to discuss "I won't use Julia unless...". Those words come across as a bit negative and are not encouraging. More general discussion needs to be on the mailing list, and issues should really be about solutions and code. |
The reason I gave a short list of reasons why Anaconda Python provides faster execution and code development for me than Julia was to provide specific constructive feedback. I didn't want to mislead the Julia team into thinking that dynamic BLAS threading control was the only thing holding me back from using Julia. I am sensitive to how specific feedback makes the goal seem attainable and encourages the Julia team (who I do support). I also ask the Julia team to be sensitive to Python users who go through a lot of effort to test Julia and find it slower despite the deceptive benchmarks and "best-of-breed C and Fortran libraries for linear algebra, random number generation, signal processing, and string processing" claim on the home page. |
Thanks for the feedback. There seem to be existing issues covering all of the things mentioned in this thread, so closing in favor of those (#5991 for dynamic BLAS threads, several MKL issues, the debugger issue, the |
I believe the words "open source" are also included somewhere near that sentence... |
If it was really a lot of effort to test things on an evaluation basis (download, install, try?), and there are actionable things we can do mitigate whatever (yet unspecified) trouble you had, please let us know. On the benchmarks, no deception has ever been intended - if your application is entirely dominated by BLAS calls and Python distribution X (or R distribution Y, or Matlab, etc) with MKL is faster for you than Julia with OpenBLAS, no one here would tell you that picking Julia over Python would be your best choice. The intent of the benchmarks is that custom code implemented in Julia, using loops and recursion etc, can be fast, which is not the case in Python. |
I added |
Dear all, I might not be at the right place to post my question and I apology for that ... but at this point, it seems to be the only discussion I've ever found on the net about accessing MKL from Julia with ccall. I've been struggling to do that on my mac and I've posted for help here : https://groups.google.com/forum/#!topic/julia-users/pairCLeym0g If any one of you can post an small example or give a clue on this thread, it would be very useful (for me of course, but for the many others who will probably attempt to do that). Any news about Julia being shipped with MKL would be appreciate too. Thanks again, |
I gave an answer at the mailing list that might help. |
I have been using Python with MKL BLAS, which defaults to dynamically setting the number of threads used by each BLAS function. See:
https://software.intel.com/en-us/node/528546
https://software.intel.com/en-us/node/528547
In Julia, I would expect comparable speed from OpenBLAS by calling
blas_set_num_threads(CPU_CORES)
, but I would get a cold slap in the face. I learned that OpenBLAS will automatically use one thread for small arrays, but otherwise use exactly the number of threads specified byblas_set_num_threads()
. In other words, my understanding is that OpenBLAS doesn't have MKL BLAS' dynamic option.As an example, on my Haswell CPU, where Julia reports
CPU_CORES
as 8, OpenBLAS' dgemv() function (from thedevelop
branch) runs fastest withblas_set_num_threads(2)
. It is not practical or realistic for me to putblas_set_num_threads()
before each (hidden) call to a BLAS function.For an Intel CPU with hyperthreading, better OpenBLAS performance would occur by using the number of physical cores instead of logical cores. For example,
blas_set_num_threads(CPU_CORES >> 1)
For portability, I suggest Julia includes a new constant:CPU_PHYSICAL_CORES
.I realize that there is already an effort to provide MKL BLAS as a build and shipping option for Julia (JuliaLang/julia#10969). Assuming that OpenBLAS won't go away, however, it would be helpful if Julia provided a layer of abstraction to make OpenBLAS as performant and easy to use as MKL BLAS, and to make Julia code portable between builds with either one.
Specifically, I suggest that
blas_set_num_threads(-1)
causes Julia to use a dynamic number of threads. When built with MKL BLAS, this would cause MKL BLAS to effectively act likeMKL_DYNAMIC
is True. When built with OpenBLAS, a lookup table for each BLAS function would determine the maximum number of threads to use. There would be a different lookup table for each CPU architecture, such as Intel Haswell or Intel SandyBridge. Preferably, the lookup table would be in source code so that it could be tuned by each user. I would expect a lot of pull requests by the community for several months on the lookup tables for various CPUs, but I believe they would appreciate the speedup.The text was updated successfully, but these errors were encountered: