-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try adding local spawn threads by default to parallelize the fork/exec process. #4532
Conversation
@ggouaillardet While working on this, I discovered that the multi-threaded fork/exec is broken. It appears there is a race condition between the odls threads, so I'm guessing they are touching something that is global. Would you have time to take a look? |
OK. Will do from Monday. |
It always occurs - just set the num threads to 2 and launch 8 local procs so both threads are actively trying to spawn. |
@rhc54 i found a race condition and pushed a commit to fix that by mutex protecting a function. at first glance, it seems this function could be invoked before the callback is invoked by a thread, |
Ick - what that tells us is that the chdir being executed in that function is causing problems across the threads. If there is only one app_context, or only one working directory for all the app_contexts, then we could just do the chdir once and move back to the daemon's wdir after everything is launched. However, that would be too limiting. I think the only real solution here is to do the chdir, assign all the local procs for a given app_context to the spawning threads, and then wait until they have all been spawned before moving on to the next app_context. Once all have been spawned, then the daemon chdir's back to its original wdir. I'll try to work on that as time permits. |
i do not think we are talking about the same thing here :-( my commit protects you are referring i ran various tests and was unable to evidence any issue involving |
Got it - sorry for confusion. I was able to clean it up so we avoid the mutex. There is a hang in comm_spawn that has nothing to do with this change, but I'm going to hold off until after the developer's telecon to see if anyone has heartburn over the change. |
Per today's telecon, the idea of defaulting to using multiple spawn threads is acceptable. However, a change was requested where we start the threads upon receipt of the launch msg and base the number of threads on the actual number of procs to be locally spawned. We also agreed to set a threshold of 32 procs for now, settable by param, until we get more data. We will continue to cap the number of threads at 8, again settable by param pending more data. |
On the call today, I reported some positive performance runs with the thread based launch option. I re-ran those numbers using today's master branch - one using It looks like after 2 threads the benefit in the |
A colleague mentionned posix_spawn that solves some of the problems related to fork/exec. |
I'm not sure I grok that documentation. It states that posix_spawn is meant to provide a way of doing "fork" where "fork" isn't available - and then states that the function starts out by calling "fork". I can only assume they mean it starts by "doing the equivalent of fork", but it sure isn't clear. Regardless, there is no explanation as to why this would be more efficient. The implication, since it specifically mentions its use on embedded systems lacking an MMU, is that it somehow avoids doing the page copy - is that your understanding? |
Maybe a better documentation is the POSIX itself. My understanding of this function is that this provide a combined "fork+exec" call, in which neither memory pages nor file descriptors are inherited. It can be implemented as a fork, followed by close all the fds plus an exec, but on systems where special support is available this function can be highly optimized (twitter link) |
@jjhursey I agree that something fishy is going on in these graphs. Since you are using only one node, the host daemon is exactly the same code (the DVM master is just mpirun). The only difference is that the DVM timings don't include the time required to setup the HNP itself since that was already done prior to executing "prun". I assume these graphs are just with "time mpirun" and "time prun", yes? Perhaps our method of measurement is misleading us. |
For these runs I force it not to use the local node:
For these I am reporting the minimum wall clock time of 5 consecutive runs. |
@jjhursey Okay, that makes sense as the time difference would be explained by the time required to start and wireup the daemon for mpirun (and not for the dvm). Hard to understand the difference in launch time curvature, though, unless somehow the DVM daemons aren't spinning up the threads? Might be worth checking with a simple print statement. |
Silly me. I was passing the I wonder if the increasing tail at the end is the impact of a processes exiting all together, and at that point there is only 1 thread so it serializes. We are talking about tenths of a second on this system though, so I don't know how much effort we want to put into getting that line flatter. But if there are folks launching many hundreds of processes per node maybe it becomes more pronounced. |
w00t! That looks much better. Seems like the "sweet spot" is at about 4 threads for both cases at full scale, which is fewer than I would have expected. I'd have to think about how you'd do the reaper as that isn't as obvious to me. I'll also take a gander at this posix_spawn thing - maybe provide that as an experimental option. |
I updated this PR per the telecon: Change the determination of number of spawn threads to be done on basis of number of local procs in first job being spawned. Someone can look at an optimization that handles subsequent dynamic spawns that might be larger in size. Leave the threads running, but blocked, for the life of the daemon, and use them to harvest the local procs as they terminate. This helps short-lived jobs in particular. Add MCA params to set:
|
…exec process to speed up the launch on large SMPs. Harvest the threads after initial spawn to minimize any impact on running jobs. Change the determination of #spawn threads to be done on basis of #local procs in first job being spawned. Someone can look at an optimization that handles subsequent dynamic spawns that might be larger in size. Leave the threads running, but blocked, for the life of the daemon, and use them to harvest the local procs as they terminate. This helps short-lived jobs in particular. Add MCA params to set: * max number of spawn threads (default: 4) * set a specific number of spawn threads (default: -1, indicating no set number) * cutoff - minimum number of local procs before using spawn threads (default: 32) Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Gilles Gouaillardet <[email protected]>
@rhc54 FYI, i pushed a fix for |
orte/mca/odls/base/odls_base_frame.c
Outdated
static int orte_odls_base_close(void) | ||
{ | ||
int i; | ||
orte_proc_t *proc; | ||
opal_list_item_t *item; | ||
|
||
opal_output(0, "CLOSE"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this debug code that should be removed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thanks - i didn't see you had made a change and was wondering why it suddenly wasn't failing! 😄
Signed-off-by: Ralph Castain <[email protected]>
Clean up the threads once initial launch is complete - let comm_spawn launches proceed at a slower pace.
Signed-off-by: Ralph Castain [email protected]