Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try adding local spawn threads by default to parallelize the fork/exec process. #4532

Merged
merged 3 commits into from
Nov 30, 2017
Merged

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Nov 25, 2017

Clean up the threads once initial launch is complete - let comm_spawn launches proceed at a slower pace.

Signed-off-by: Ralph Castain [email protected]

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 25, 2017

@ggouaillardet While working on this, I discovered that the multi-threaded fork/exec is broken. It appears there is a race condition between the odls threads, so I'm guessing they are touching something that is global.

Would you have time to take a look?

@ggouaillardet
Copy link
Contributor

OK. Will do from Monday.
Just to be clear, does the issue occur only with this PR ?
Or does it also occur in master when the num of threads is forced to a non zero (and non one ?) value ?

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 25, 2017

It always occurs - just set the num threads to 2 and launch 8 local procs so both threads are actively trying to spawn.

@ggouaillardet
Copy link
Contributor

@rhc54 i found a race condition and pushed a commit to fix that by mutex protecting a function.

at first glance, it seems this function could be invoked before the callback is invoked by a thread,
or it could even be invoked only once per app. that would eliminate the need for a mutex.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 27, 2017

Ick - what that tells us is that the chdir being executed in that function is causing problems across the threads. If there is only one app_context, or only one working directory for all the app_contexts, then we could just do the chdir once and move back to the daemon's wdir after everything is launched. However, that would be too limiting.

I think the only real solution here is to do the chdir, assign all the local procs for a given app_context to the spawning threads, and then wait until they have all been spawned before moving on to the next app_context. Once all have been spawned, then the daemon chdir's back to its original wdir.

I'll try to work on that as time permits.

@ggouaillardet
Copy link
Contributor

i do not think we are talking about the same thing here :-(

my commit protects orte_util_check_context_app() so there is no more race condition accessing/updating context->app (fwiw, the race conditions can have opal_path_findv() returns NULL, even with something as simple as mpirun --mca odls_base_num_threads 2 -np 8 true).

you are referring chdir() that is invoked in orte_util_check_context_cwd(), and that is invoked in the "serial"part (e.g. not in an odls progress thread)

i ran various tests and was unable to evidence any issue involving chdir()

@open-mpi open-mpi deleted a comment from ibm-ompi Nov 27, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 27, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 27, 2017
@rhc54
Copy link
Contributor Author

rhc54 commented Nov 27, 2017

Got it - sorry for confusion. I was able to clean it up so we avoid the mutex. There is a hang in comm_spawn that has nothing to do with this change, but I'm going to hold off until after the developer's telecon to see if anyone has heartburn over the change.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 28, 2017

Per today's telecon, the idea of defaulting to using multiple spawn threads is acceptable. However, a change was requested where we start the threads upon receipt of the launch msg and base the number of threads on the actual number of procs to be locally spawned. We also agreed to set a threshold of 32 procs for now, settable by param, until we get more data. We will continue to cap the number of threads at 8, again settable by param pending more data.

@jjhursey
Copy link
Member

On the call today, I reported some positive performance runs with the thread based launch option. I re-ran those numbers using today's master branch - one using mpirun based launch, and one using the DVM with prun launching. Attached are the graphs.

It looks like after 2 threads the benefit in the mpirun based launch diminishes. Just supplying the -mca odls_base_num_threads 1 option seems to have benefit. There was no benefit to the prun launch which is interesting.

ppn-scaling-dvm.pdf
ppn-scaling.pdf

@bosilca
Copy link
Member

bosilca commented Nov 28, 2017

A colleague mentionned posix_spawn that solves some of the problems related to fork/exec.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 28, 2017

I'm not sure I grok that documentation. It states that posix_spawn is meant to provide a way of doing "fork" where "fork" isn't available - and then states that the function starts out by calling "fork". I can only assume they mean it starts by "doing the equivalent of fork", but it sure isn't clear.

Regardless, there is no explanation as to why this would be more efficient. The implication, since it specifically mentions its use on embedded systems lacking an MMU, is that it somehow avoids doing the page copy - is that your understanding?

@bosilca
Copy link
Member

bosilca commented Nov 28, 2017

Maybe a better documentation is the POSIX itself. My understanding of this function is that this provide a combined "fork+exec" call, in which neither memory pages nor file descriptors are inherited. It can be implemented as a fork, followed by close all the fds plus an exec, but on systems where special support is available this function can be highly optimized (twitter link)

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

@jjhursey I agree that something fishy is going on in these graphs. Since you are using only one node, the host daemon is exactly the same code (the DVM master is just mpirun). The only difference is that the DVM timings don't include the time required to setup the HNP itself since that was already done prior to executing "prun".

I assume these graphs are just with "time mpirun" and "time prun", yes? Perhaps our method of measurement is misleading us.

@jjhursey
Copy link
Member

For these runs I force it not to use the local node:

# Where ${X} is the NP along the x axis
mpirun -np ${X} -npernode 160 --nolocal --hostfile hostfile-mpi.txt /bin/true

# Since --nolocal doesn't work with prun, force a specific host 
# that is the same as would have been chosen by mpirun above
prun --system-server-only -np ${X} -npernode 160  --host nodeA /bin/true

For these I am reporting the minimum wall clock time of 5 consecutive runs.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

@jjhursey Okay, that makes sense as the time difference would be explained by the time required to start and wireup the daemon for mpirun (and not for the dvm). Hard to understand the difference in launch time curvature, though, unless somehow the DVM daemons aren't spinning up the threads? Might be worth checking with a simple print statement.

@jjhursey
Copy link
Member

Silly me. I was passing the -mca odls_base_num_threads parameter to the prun not orte-dvm that's why it wasn't showing any gains. Attached is a new graph from this morning that looks a bit better.

ppn-scaling-dvm.pdf

I wonder if the increasing tail at the end is the impact of a processes exiting all together, and at that point there is only 1 thread so it serializes. We are talking about tenths of a second on this system though, so I don't know how much effort we want to put into getting that line flatter. But if there are folks launching many hundreds of processes per node maybe it becomes more pronounced.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

w00t! That looks much better. Seems like the "sweet spot" is at about 4 threads for both cases at full scale, which is fewer than I would have expected. I'd have to think about how you'd do the reaper as that isn't as obvious to me. I'll also take a gander at this posix_spawn thing - maybe provide that as an experimental option.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 30, 2017

I updated this PR per the telecon:

Change the determination of number of spawn threads to be done on basis of number of local procs in first job being spawned. Someone can look at an optimization that handles subsequent dynamic spawns that might be larger in size.

Leave the threads running, but blocked, for the life of the daemon, and use them to harvest the local procs as they terminate. This helps short-lived jobs in particular.

Add MCA params to set:

  • max number of spawn threads (default: 4, based on @jjhursey data)
  • set a specific number of spawn threads (default: -1, indicating no set number)
  • cutoff - minimum number of local procs before using spawn threads (default: 32)

…exec process to speed up the launch on large SMPs. Harvest the threads after initial spawn to minimize any impact on running jobs.

Change the determination of #spawn threads to be done on basis of #local procs in first job being spawned. Someone can look at an optimization that handles subsequent dynamic spawns that might be larger in size.

Leave the threads running, but blocked, for the life of the daemon, and use them to harvest the local procs as they terminate. This helps short-lived jobs in particular.

Add MCA params to set:
  * max number of spawn threads (default: 4)
  * set a specific number of spawn threads (default: -1, indicating no set number)
  * cutoff - minimum number of local procs before using spawn threads (default: 32)

Signed-off-by: Ralph Castain <[email protected]>
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 30, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 30, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 30, 2017
Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor

@rhc54 FYI, i pushed a fix for plm/alps. could you please double check it ?

static int orte_odls_base_close(void)
{
int i;
orte_proc_t *proc;
opal_list_item_t *item;

opal_output(0, "CLOSE");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this debug code that should be removed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks - i didn't see you had made a change and was wondering why it suddenly wasn't failing! 😄

Signed-off-by: Ralph Castain <[email protected]>
@rhc54 rhc54 merged commit 0fcc996 into open-mpi:master Nov 30, 2017
@rhc54 rhc54 deleted the topic/odls branch November 30, 2017 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants