Try adding local spawn threads by default to parallelize the fork/exec process. #4532

rhc54 · 2017-11-25T00:23:24Z

Clean up the threads once initial launch is complete - let comm_spawn launches proceed at a slower pace.

Signed-off-by: Ralph Castain [email protected]

rhc54 · 2017-11-25T00:24:28Z

@ggouaillardet While working on this, I discovered that the multi-threaded fork/exec is broken. It appears there is a race condition between the odls threads, so I'm guessing they are touching something that is global.

Would you have time to take a look?

ggouaillardet · 2017-11-25T01:50:23Z

OK. Will do from Monday.
Just to be clear, does the issue occur only with this PR ?
Or does it also occur in master when the num of threads is forced to a non zero (and non one ?) value ?

rhc54 · 2017-11-25T03:02:13Z

It always occurs - just set the num threads to 2 and launch 8 local procs so both threads are actively trying to spawn.

ggouaillardet · 2017-11-27T02:00:23Z

@rhc54 i found a race condition and pushed a commit to fix that by mutex protecting a function.

at first glance, it seems this function could be invoked before the callback is invoked by a thread,
or it could even be invoked only once per app. that would eliminate the need for a mutex.

rhc54 · 2017-11-27T05:21:11Z

Ick - what that tells us is that the chdir being executed in that function is causing problems across the threads. If there is only one app_context, or only one working directory for all the app_contexts, then we could just do the chdir once and move back to the daemon's wdir after everything is launched. However, that would be too limiting.

I think the only real solution here is to do the chdir, assign all the local procs for a given app_context to the spawning threads, and then wait until they have all been spawned before moving on to the next app_context. Once all have been spawned, then the daemon chdir's back to its original wdir.

I'll try to work on that as time permits.

ggouaillardet · 2017-11-27T06:06:40Z

i do not think we are talking about the same thing here :-(

my commit protects orte_util_check_context_app() so there is no more race condition accessing/updating context->app (fwiw, the race conditions can have opal_path_findv() returns NULL, even with something as simple as mpirun --mca odls_base_num_threads 2 -np 8 true).

you are referring chdir() that is invoked in orte_util_check_context_cwd(), and that is invoked in the "serial"part (e.g. not in an odls progress thread)

i ran various tests and was unable to evidence any issue involving chdir()

rhc54 · 2017-11-27T18:26:42Z

Got it - sorry for confusion. I was able to clean it up so we avoid the mutex. There is a hang in comm_spawn that has nothing to do with this change, but I'm going to hold off until after the developer's telecon to see if anyone has heartburn over the change.

rhc54 · 2017-11-28T19:51:31Z

Per today's telecon, the idea of defaulting to using multiple spawn threads is acceptable. However, a change was requested where we start the threads upon receipt of the launch msg and base the number of threads on the actual number of procs to be locally spawned. We also agreed to set a threshold of 32 procs for now, settable by param, until we get more data. We will continue to cap the number of threads at 8, again settable by param pending more data.

jjhursey · 2017-11-28T22:45:10Z

On the call today, I reported some positive performance runs with the thread based launch option. I re-ran those numbers using today's master branch - one using mpirun based launch, and one using the DVM with prun launching. Attached are the graphs.

It looks like after 2 threads the benefit in the mpirun based launch diminishes. Just supplying the -mca odls_base_num_threads 1 option seems to have benefit. There was no benefit to the prun launch which is interesting.

ppn-scaling-dvm.pdf
ppn-scaling.pdf

bosilca · 2017-11-28T23:17:52Z

A colleague mentionned posix_spawn that solves some of the problems related to fork/exec.

rhc54 · 2017-11-28T23:26:19Z

I'm not sure I grok that documentation. It states that posix_spawn is meant to provide a way of doing "fork" where "fork" isn't available - and then states that the function starts out by calling "fork". I can only assume they mean it starts by "doing the equivalent of fork", but it sure isn't clear.

Regardless, there is no explanation as to why this would be more efficient. The implication, since it specifically mentions its use on embedded systems lacking an MMU, is that it somehow avoids doing the page copy - is that your understanding?

bosilca · 2017-11-28T23:40:38Z

Maybe a better documentation is the POSIX itself. My understanding of this function is that this provide a combined "fork+exec" call, in which neither memory pages nor file descriptors are inherited. It can be implemented as a fork, followed by close all the fds plus an exec, but on systems where special support is available this function can be highly optimized (twitter link)

rhc54 · 2017-11-29T00:53:04Z

@jjhursey I agree that something fishy is going on in these graphs. Since you are using only one node, the host daemon is exactly the same code (the DVM master is just mpirun). The only difference is that the DVM timings don't include the time required to setup the HNP itself since that was already done prior to executing "prun".

I assume these graphs are just with "time mpirun" and "time prun", yes? Perhaps our method of measurement is misleading us.

jjhursey · 2017-11-29T02:40:23Z

For these runs I force it not to use the local node:

# Where ${X} is the NP along the x axis
mpirun -np ${X} -npernode 160 --nolocal --hostfile hostfile-mpi.txt /bin/true

# Since --nolocal doesn't work with prun, force a specific host 
# that is the same as would have been chosen by mpirun above
prun --system-server-only -np ${X} -npernode 160  --host nodeA /bin/true

For these I am reporting the minimum wall clock time of 5 consecutive runs.

rhc54 · 2017-11-29T03:21:16Z

@jjhursey Okay, that makes sense as the time difference would be explained by the time required to start and wireup the daemon for mpirun (and not for the dvm). Hard to understand the difference in launch time curvature, though, unless somehow the DVM daemons aren't spinning up the threads? Might be worth checking with a simple print statement.

jjhursey · 2017-11-29T14:04:57Z

Silly me. I was passing the -mca odls_base_num_threads parameter to the prun not orte-dvm that's why it wasn't showing any gains. Attached is a new graph from this morning that looks a bit better.

ppn-scaling-dvm.pdf

I wonder if the increasing tail at the end is the impact of a processes exiting all together, and at that point there is only 1 thread so it serializes. We are talking about tenths of a second on this system though, so I don't know how much effort we want to put into getting that line flatter. But if there are folks launching many hundreds of processes per node maybe it becomes more pronounced.

rhc54 · 2017-11-29T14:56:12Z

w00t! That looks much better. Seems like the "sweet spot" is at about 4 threads for both cases at full scale, which is fewer than I would have expected. I'd have to think about how you'd do the reaper as that isn't as obvious to me. I'll also take a gander at this posix_spawn thing - maybe provide that as an experimental option.

rhc54 · 2017-11-30T03:23:57Z

I updated this PR per the telecon:

Change the determination of number of spawn threads to be done on basis of number of local procs in first job being spawned. Someone can look at an optimization that handles subsequent dynamic spawns that might be larger in size.

Leave the threads running, but blocked, for the life of the daemon, and use them to harvest the local procs as they terminate. This helps short-lived jobs in particular.

Add MCA params to set:

max number of spawn threads (default: 4, based on @jjhursey data)
set a specific number of spawn threads (default: -1, indicating no set number)
cutoff - minimum number of local procs before using spawn threads (default: 32)

…exec process to speed up the launch on large SMPs. Harvest the threads after initial spawn to minimize any impact on running jobs. Change the determination of #spawn threads to be done on basis of #local procs in first job being spawned. Someone can look at an optimization that handles subsequent dynamic spawns that might be larger in size. Leave the threads running, but blocked, for the life of the daemon, and use them to harvest the local procs as they terminate. This helps short-lived jobs in particular. Add MCA params to set: * max number of spawn threads (default: 4) * set a specific number of spawn threads (default: -1, indicating no set number) * cutoff - minimum number of local procs before using spawn threads (default: 32) Signed-off-by: Ralph Castain <[email protected]>

Signed-off-by: Gilles Gouaillardet <[email protected]>

ggouaillardet · 2017-11-30T04:52:11Z

@rhc54 FYI, i pushed a fix for plm/alps. could you please double check it ?

ggouaillardet · 2017-11-30T04:59:23Z

orte/mca/odls/base/odls_base_frame.c

 static int orte_odls_base_close(void)
 {
    int i;
    orte_proc_t *proc;
    opal_list_item_t *item;

+opal_output(0, "CLOSE");


is this debug code that should be removed ?

yes, thanks - i didn't see you had made a change and was wondering why it suddenly wasn't failing! 😄

Signed-off-by: Ralph Castain <[email protected]>

rhc54 added the ⚠️ WIP-DNM! label Nov 25, 2017

rhc54 added Target: main and removed ⚠️ WIP-DNM! labels Nov 27, 2017

open-mpi deleted a comment from ibm-ompi Nov 27, 2017

open-mpi deleted a comment from ibm-ompi Nov 30, 2017

plm/alps: fix orte_wait_cb() usage

8e17127

Signed-off-by: Gilles Gouaillardet <[email protected]>

ggouaillardet reviewed Nov 30, 2017

View reviewed changes

Remove debug

335fc96

Signed-off-by: Ralph Castain <[email protected]>

rhc54 merged commit 0fcc996 into open-mpi:master Nov 30, 2017

rhc54 deleted the topic/odls branch November 30, 2017 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try adding local spawn threads by default to parallelize the fork/exec process. #4532

Try adding local spawn threads by default to parallelize the fork/exec process. #4532

rhc54 commented Nov 25, 2017

rhc54 commented Nov 25, 2017

ggouaillardet commented Nov 25, 2017

rhc54 commented Nov 25, 2017

ggouaillardet commented Nov 27, 2017

rhc54 commented Nov 27, 2017

ggouaillardet commented Nov 27, 2017

rhc54 commented Nov 27, 2017

rhc54 commented Nov 28, 2017

jjhursey commented Nov 28, 2017

bosilca commented Nov 28, 2017

rhc54 commented Nov 28, 2017

bosilca commented Nov 28, 2017

rhc54 commented Nov 29, 2017

jjhursey commented Nov 29, 2017

rhc54 commented Nov 29, 2017

jjhursey commented Nov 29, 2017

rhc54 commented Nov 29, 2017

rhc54 commented Nov 30, 2017

ggouaillardet commented Nov 30, 2017

ggouaillardet Nov 30, 2017

rhc54 Nov 30, 2017

Try adding local spawn threads by default to parallelize the fork/exec process. #4532

Try adding local spawn threads by default to parallelize the fork/exec process. #4532

Conversation

rhc54 commented Nov 25, 2017

rhc54 commented Nov 25, 2017

ggouaillardet commented Nov 25, 2017

rhc54 commented Nov 25, 2017

ggouaillardet commented Nov 27, 2017

rhc54 commented Nov 27, 2017

ggouaillardet commented Nov 27, 2017

rhc54 commented Nov 27, 2017

rhc54 commented Nov 28, 2017

jjhursey commented Nov 28, 2017

bosilca commented Nov 28, 2017

rhc54 commented Nov 28, 2017

bosilca commented Nov 28, 2017

rhc54 commented Nov 29, 2017

jjhursey commented Nov 29, 2017

rhc54 commented Nov 29, 2017

jjhursey commented Nov 29, 2017

rhc54 commented Nov 29, 2017

rhc54 commented Nov 30, 2017

ggouaillardet commented Nov 30, 2017

ggouaillardet Nov 30, 2017

Choose a reason for hiding this comment

rhc54 Nov 30, 2017

Choose a reason for hiding this comment