-
-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mold concurrency issues with ninja #117
Comments
Thank you for raising an important point. mold is highly parallelized, and that performance characteristic is different from what the ordinary build system expects. Eventually we need to improve ninja so that it adjusts its expectation on linker's CPU usage. For now, did you try ninja's For the record, I'll leave random thoughts here on what is the problem and what we should do:
|
This might be of interest: https://ninja-build.org/manual.html#ref_pool The usage example shows how to limit the number of parallel jobs when linking, because linkers themselves are already parallel (and memory hungry). |
Cmake already supports assigning link commands to a pool: https://cmake.org/cmake/help/latest/prop_tgt/JOB_POOL_LINK.html and https://cmake.org/cmake/help/latest/prop_gbl/JOB_POOLS.html |
A fairly common thing to do is support the make job pool protocol. It is simply a semaphore implemented by a unix pipe: https://www.gnu.org/software/make/manual/html_node/Job-Slots.html Unfortunately ninja does not support this, but other tools, such as rust's cargo do (and gnu make of course). |
There was a PR to implement jobserver protocol in ninja but it has not been merged. See ninja-build/ninja#1140 ninja-build/ninja#1139 |
It's also worth mentioning that gcc/ld support the make jobserver for lto. |
I think one of the problems of jobserver is that a process that request a new job slot can block indefinitely until the request is satisfied. Requesting a job slot is just writing a byte to a fifo, so if mold wants to reserve 16 jobs for itself for example, then it blocks until 16 bytes are fully written to a fifo. I think that will leave lots of resources unused during compilation. |
Minor note, iirc requesting a job is reading a byte, you return the job by writing it back again. |
I guess the idea would be to adjust your threadpool dynamically when the jobserver allows you to get a token, not to wait for all the tokens to be available at once. |
@glandium Can you do that? I mean, is there any way to know how many bytes you can write to a fifo without writing to it? |
@rui314 Unless -j is more than PIPE_BUF I don't see how writing can block, besides, to get a job you read, so you can use select or poll to continuously request job slots and return them with write when you are done. I guess to do it you still need to make the read fd non blocking on read. https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html#POSIX-Jobserver |
@andrewchambers I think you are right (and I confused writing with reading again and again...) Let me think about whether we should support job server protocol or not. It needs to be resolved in a big picture, and I'm not confident if the job server can solve it entirely, as the CPU is not the only resource that limits compilation speed. Memory is sometimes more strict constraint. |
@rui314 I wonder; does mold need all threads to start the same time because they communicate with each other? Otherwise I guess each thread should wait for a token from the make job server pipe separately. In my experience the only thing that have worked really good for us was back when we used the make job server for everything in our big build. We have a top level Makefile orchestrating the whole build/packaging for different platforms etc. Then we introduced CMake and ninja which does not cooperate with make. After this we have parallel ninja processes running on the same machine called from the top level Makefile. We try to share the cpu-threads between the ninja jobs and not overcommit too much but it does not work that well (both loading too little and too much depending on ccache performance on a particular commit/platform). However the lto linking jobs actually helps even stuff out because ld.bfd creates that temporary makefile that makes the lto-linking obey the job server properly. Also, in my experience the make load-average flag is superior to the ninja load-average flag. Make tries to estimate the load average when handling jobs but for ninja the cpu load graph looked a comb with spikes as there is a delay calculating load average. With ccache the compilation jobs can be very quick and the kernel cannot keep up I guess. This may have changed nowadays, IDK. |
We might be able to add threads to the thread pool when a new thread becomes available via the job server pipe, but I don't think that could make things a lot better. Let's say we have four linker processes. The worst scenario is that the four processes running simultaneously and each process takes 4x more time to finish than it could have done on a silent machine. During the execution of the processes, they consume 4x memory which is the biggest problem. So, what if we make a change to the linker so that each process gradually adds more threads from the thread pool? Since it is after all CPU constrained, the total execution time wouldn't change that much (besides the overhead of frequent process switches), and the total memory consumption would be the same too. What we really want to do is to run less number of linker processes simultaneously. If the linker itself is scalable to more number of cores, there's not much point to run them simultaneously in the first place. If we run each linker one by one, it still takes 4x time in total, but the peak memory usage becomes flat instead of 4x. So, as you can see, the problem is that the make job server's concept doesn't fit very well to mold. Job server assumes that each process is single-threaded and is trying to optimize the CPU usage in that problem space. mold broke that assumption. That's the fundamental problem, I think. |
@rui314 If I understand you correctly I agree in most cases. But I think our scenario is slightly different. We have no control over how many linker processes are running at the same time as we have several ninja-instances for different platforms running at the same time orchestrated from a parent Makefile. It is also sharing the cpu with a number of other packaging scripts, test run scripts, 3pp builds and other things. In these scripts we try to use Makefile:s to parallelize whenever it is needed. And it works very well to keep cpu under control as long as everything is using make. Sometimes we also have multiple of these builds on the same machine where the load-average-flag comes into play. In order to make the ninja pools work properly I guess you will always need to have a single ninja instance? This can be quite hard to accomplish for bigger build systems with a lot of moving parts. In our case we can have call chains like "Makefile -> ninja * 5 -> test run script -> Makefile -> test suite * 40" or "Makefile -> ninja * 7 -> g++ -> ld.bfd -> Generated Makefile -> Linker thread * 30" for example. The make job server really helps keeping cpu (and indirectly memory) under control during the different build phases. My point is unless I understood ninja thread pools wrong the make job server is vastly superior to pools for our use case. It just works with recursive make files mixed with scripts. There are some ninja forks with support for make job server and we will probably want to go there. People are forking ninja to get this feature. The method of dynamically allocating threads when you can read a token could fit in the make model. In practice I guess it is how it works in in the bfd/gold case with the generated temporary lto-makefiles that will not start a new lto-linking-"thread-process" unless make can get a token first. In both cases it will keep the token until that linking thread is done I guess. |
Invoking ninja from make seems more like an anti pattern to me. |
Agreed. Ninja files are usually generated by CMake (or some other build file generator), so you may be able to generate Makefile and use Make throughout your project. |
I agree that make calling ninja can be seen as an antipattern. It would be nice to not have to tune threading for ninja. But in real life it is often not that simple to change... When migrating to CMake we started using Makefiles but we had to switch to ninja as this would almost halve the build time. The reason is a combination of ccache and how CMake generated Makefiles which would call the compiler twice; first to get deps and then to compile. This would cause a sysload as high as 4/5 of the cpu due to the excessive file handling when you have a good ccache hit rate (mostly only preprocessing). The generated ninja files would only call the compiler once and get the deps as a sideffect. (It is possible that CMake can generate smarter (gnu) Makefiles nowadays but since then we have started to use the nice ninja dependency tree for queries and stuff and are kind of stuck.) There are other scenarios where we don't have control over the build system. For example calling 3pp build systems and legacy parts. It would be a lot of work trying to squeeze everything into a single CMake-generated ninja file and then tune pools. We haven't really had any big RAM issues so we haven't had any reason to restrict linking, but we have 384G on our build machines. I'm just trying to argue that the make job server is for us the adhoc standard that makes it possible to maintain big highly parallellized build system with lots of moving parts where some are out of our control. Make job server is one of the first things I look for in new tools using multithreading. For a linker I look at if it supports gcc lto as well. For our scenario it looks like mold would fit very nice!!! We rely heavily on ccache so linking is most often the build bottleneck. But this is also the reason why we would not want to restrict linking that much; when ccache has good hit rate we want to use the whole machine for linking. For example, during build of the test platform we link ~1500 so-libs and ~2200 executables (excluding 3pp:s) so linking performance is very important :) |
I would like to point out that GNU make has recently added a promising named pipe jobserver. Please see more in the following Ninja post: ninja-build/ninja#1139 (comment). |
I've just implemented an experimental patch but it has an unexpected implication as mold can run GCC with LTO mode that then can't read any jobserver token and we end up with a serial phase. That said, one ( |
@marxin: I'm not sure I follow. Do you mean that job server tokens will be allocated during phases when mold does not need/use the threads? Quickly looking at the patch it looks like it grabs as many threads as possible from the job token server. I guess this approach can starve other processes in the build. I would kind of hope that you could affect the max threading somehow, preferably also by reading "-j" and "-l" from $MAKEFLAGS. But to mimic make I guess each started thread should probably grab a token by reading blocking from the token pipe, in practice hang until a token is available. And then put it back when the thread exits. But this is without knowing how threading works in mold, guess it could lock up everything when threads do not start until they get a token. It could possibly be even more fine grained if a thread reads a token when it needs to actually do some processing and writes it back when done processing. But maybe the threading in mold is not some kind of message loop so this model does not fit at all. |
Yes, at least in my prototype implementation.
Yes, my implementation is naive and uses a greedy approach :/ Note
It depends. It's likely better to not wait and start processing "jobs" in a single thread and probably ask for another token once there's another opportunity for threading.
Yep, it should likely return it when not needed. But I think the token manipulation should be integrated into the underlying TBB library which knows more about threads spawning and can dynamically increase/decrease it during the run of the mold linker. Anyway, it's not as trivial as I thought. |
If you plan to take this greedy approach you could probably write a helper program that simply takes jobserver jobs and then executes another command (mold in this case) with the right flag for how many jobs it was able to grab. |
@andrewchambers: I agree, such a program could be useful in itself. However, in order to make it really useful for efficiently limiting total load during a build while while not starving other jobs a more dynamic approach is needed. But that could prove to be a bit too complex and maybe does not fit the mold threading model at all. (We had an attempt to use ninja pools for linking for another case where we only have one ninja instance. It was hard to tune and was always slower than the "free threading" approach. The linking jobs vary a lot in time/size so it feels clunky to tune as well.) |
I'm glad to see someone else has already suggested the GNU jobserver here. We run into this problem in Gentoo because our package manager is source-based. To get a predictable level of CPU usage, users generally want to configure the number of simultaneous CPU-intensive jobs that can be run at any one time, regardless of the nature of those jobs (compile, link, whatever). Without coordination, you wind up in a situation where every tool wants to fire off The GNU jobserver is the closest thing to a standard way to coordinate between the different processes, and hopefully ninja will finally adopt it soon. |
I'm thinking of something easier to implement and understand. That is, what if we just allow only one instance of mold at a time? There's not much point of running multiple mold instances on a single machine simultaneously. So, if mold is invoked while other mold instance is running, we can just make it to wait until the first process finishes. |
How do you suppose that is to work when containers and multiple users are involved? That is, what mechanism will be used to measure "uniqueness"? And why is |
I consider this more a matter of predictability than anything else, and even one instance of mold is unpredictable at the moment. If I have 8 cores, I might (for example) configure my package manager to launch at most two simultaneous jobs via That may not be the optimal use of my resources, but it matches most closely my intent when I set e.g. I can of course force mold to use only one thread, but that's overly pessimistic: if at any point linking is all that's happening, I want mold to be able to use as many threads as there are job slots available. |
Note that there's also things that |
Keep in mind that there's no point of running multiple mold instances simultaneous unless your machine has really a lot of cores like 64 or 128. If you share 16 cores with two mold instances, they will take 8 cores each, and each of them takes 2x more time than it would have taken with 16 cores. The peak memory usage is 2x because you are running two processes simultaneously. If you run two processes serially, you can reduce the peak memory usage by half. In general, multi-threaded, scalable programs don't need to be invoked in parallel with others to shorten the overall latency. |
That's exactly the issue, but it's not just multiple mold processes that are a problem. If one mold process is using all of my available CPU, it's also increasingly pointless to launch further compilation processes. The jobserver protocol lets gcc, mold, etc. all pull from the same pool so that no more threads are used than would be useful (or than have been specified by the user; maybe I want to build two programs in two terminals and allocate to each of them half of my cores). |
So there are two types of processes: single-threaded and multi-threaded. Jobserver is designed only with the former in mind and doesn't work well to coordinate the latter. We for example don't want to run mold with 3 threads when the jobserver has three available cores and later invoke another mold instance with a single thread when one core becomes available. We want to run the first mold instance with all available cores instead so that it finishes as quickly as possible. This can be achieved by reserving as many cores as available in the mold startup routine, preventing other mold processes to run, and forgeting about the jobserver. |
So, if N cores are available, we want to run up to N single-threaded processes and zero or one multi-threaded process (which is not affected by how large N is). That should achieve the maximum CPU utilization while minimizing the peak memory usage. |
The number of physical cores is a red herring I think. We want to run as many jobs as the user asks us to. If he has two terminals open and invokes one build system in each terminal with That may be "suboptimal," but only in one specific sense. For us (and in the original issue report), predictability is more important than maximal resource usage at all times.
I think that's what should happen :) If a core (job slot) becomes available and if a linking job is queued up next, we should start it. When the other cores become available, there will generally be other jobs waiting to eat them up. The machine may wind up idle at some point, but if the user asks for (say) two jobs, we shouldn't try to outsmart him. |
I think that the notion of "jobs" is actually a red herring. Build tools has long been single-threaded. make runs multiple processes concurrently as a workaround for it. If each tool can saturate the computing power and the IO isn't a bottleneck, make doesn't have to run them concurrently. It would have a negative impact to the performance because the working set size increases as the number of concurrent processes increases. So there are really two types of processes: traditional, single-threaded one and modern, multi-threaded one. The latter scales by itself and doesn't need the workaround.
I disagree. If you just apply the notion of "jobs" to multi-threaded, scalable processes, I understand that you would reach that conclusion. But the notion of jobs isn't directly applicable to such processes. |
We don't want to saturate the CPU! I cannot stress that enough. We want to do what the user asks. That may saturate the CPU or not.
Everyone is aware of the difference. Quoting https://www.gnu.org/software/make/manual/html_node/Job-Slots.html,
You're focused on squeezing the maximum performance out of the build. That's the right default, and is why people want to use mold in the first place. But what we're asking for is something different: if the user explicitly asks (with |
Saturating CPU is fine. Rather, we want to always saturate CPU. Unlike memory whose overallocation causes thrashing, having many threads waiting for CPU cycles is desirable, because otherwise we would just waste CPU cycles to run idle loops. If you want to make some processes to use less CPU resources to allocate more CPU resources to other groups of processes, you should use |
No, it's not.
If the user says the CPU should not be saturated, then the CPU should not be saturated. |
(1) may make sense but (2) and (3) are not. If the maximum CPU utilization is enforced by some hardware constraint, it really needs to be enforced by some kernel-level mechanism, such as just downclocking the CPU. |
There is also |
Hi, I've recently started using What's the current best practice for using |
Put all |
In the above commit, I added an experimental feature to limit the number of active mold process to 1. By setting the environment variable Setting this environment variable to Can you guys try this environment variable to see if it's useful? |
I'm not sure if I'm using it right, but I built ld: 45s So, slightly faster with the most recent commit (but that might be unrelated to the MOLD_JOBS part), but not a huge difference. For reference, here's a zip file of the 3 traces (viewable on https://ui.perfetto.dev/) for the three builds above: Anecdotally, a coworker of mine is seeing about a 50% slowdown using |
That change is not supposed to improve build speed because it decreases peak memory usage by sacrificing the parallelism a little bit. So if your system has enough RAM, it could even have a slightly negative impact to the performance. However, if your build is under a memory pressure, the situation might be different. |
I recently tried to build LLVM on Asahi Linux on my Mac Mini, which is equipped with a 8-core processor and 16 GiB of RAM. It's a relatively powerful machine but has limited memory. On that machine, I found that Has anyone else had a similar experience? |
Urgh. This kind of file belong in |
Then a user on the same system could do a mild DoS attack against you by creating your lockfile in /tmp before you do and make it unreadable. |
For the lockfile in |
@NobodyXu Thanks for the info! I wasn't aware that |
Now that we have the |
Mold works great when compiling for a single target binary, but when we do a complete build repo wide (ninja with no target), things start to grind to a halt toward the end of the build when concurrent linking is occurring. It seems like it might be best to turn off concurrent linking when using ninja as ninja is already trying to use available cores for its own parallelism. I have tried disabling mold concurrency with -Wl,--no-threads but I don't think that is correct. Do you have any guidance for ninja+mold?
The text was updated successfully, but these errors were encountered: