MPI backend examples launch processes independently in each pod #89

jwwandy · 2018-10-24T10:30:06Z

https://github.com/kubeflow/pytorch-operator/blob/master/examples/ddp/mnist/gpu/v1alpha2/job_mnist_DDP_GPU.yaml

When launching MPI backend jobs examples above with ENTRYPOINT ["mpirun", "-n", "4", "--allow-run-as-root", "python", "-u", "/opt/pytorch_dist_mnist/mnist_ddp_gpu.py"] in Dockerfile,I expected to do distributed training where it launched 1 process on each pod(totally 4, with 1 master and 3 workers).

However, it seems like it launched 4 processes on each pod and trained independently.
Is there anything I misunderstood of this examples?

The text was updated successfully, but these errors were encountered:

johnugeorge · 2018-10-24T10:58:29Z

@Akado2009

Akado2009 · 2018-11-20T06:14:11Z

@jwwandy, greetings. For now, as far as I know -n option in mpirun species the number of copies of process to run, not just the number of containers/pods.

jwwandy · 2018-11-20T07:24:40Z

@Akado2009 Nice to hear from you. Exactly as you mentioned, the -n option in mpirun species the number of copies of process to run and should schedule them on different MPI nodes(no matter how many there are).

What I'm confused is that it seems like there's no mechanism for current examples to discover pods for openMPI as a MPI cluster for launching process across pods. Each pod functioned as an independent MPI cluster launching its own group of processes.

If the examples are just aimed to launch multiple process independently on each pod, rather than do distributed training across pods, then the current example works well. However, I assume part of usage of kubeflow is to do distributed training across multiple pods, which I can only achieve by adding ssh key after pod created as openmpi documents https://www.open-mpi.org/faq/?category=rsh for now.

Akado2009 · 2018-11-20T07:46:19Z

@jwwandy Sorry for the late response, was busy working. But yeah, you're right, this example treats each pod as a separate openMPI cluster.

I was thinking about making an upgraded version of this example, so that it treats you k8s cluster as an openmpi cluster, then you're job is gonna be real distributed.

jwwandy · 2018-11-20T08:18:57Z

@Akado2009 Thanks to make it clear.

Although my current workaround is quite dirty by writing a shell script and downward API to setup all of ssh stuff after pods creation, I think they could(and should) be done by controller.

Generating private ssh key for each pod and broadcast public key to all pods as authorized keys
Adding ssh known_host (Also could disable the feature in ssh_config)
Having a hostfile(with hostname from yaml) for mpirun --hostfile option

Wish these short steps might help some.

Akado2009 · 2018-11-20T16:32:22Z

@jwwandy Yes, I agree, that it should be done by the controller.
Thank you for your workaround, I am going try to implement this logic inside the controller :)

ilchemla · 2019-07-08T08:28:18Z

Any news about this issue?

johnugeorge · 2019-07-08T08:36:02Z

Can mpi-operator solve your issue? What is your use case?

jtfogarty · 2020-01-14T21:22:35Z

/area operator
/kind feature
/priority p2

johnugeorge mentioned this issue Jun 6, 2019

Question: MNIST example #163

Closed

k8s-ci-robot added area/operator kind/feature priority/p2 labels Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI backend examples launch processes independently in each pod #89

MPI backend examples launch processes independently in each pod #89

jwwandy commented Oct 24, 2018

johnugeorge commented Oct 24, 2018

Akado2009 commented Nov 20, 2018

jwwandy commented Nov 20, 2018 •

edited

Loading

Akado2009 commented Nov 20, 2018

jwwandy commented Nov 20, 2018 •

edited

Loading

Akado2009 commented Nov 20, 2018

ilchemla commented Jul 8, 2019

johnugeorge commented Jul 8, 2019

jtfogarty commented Jan 14, 2020

MPI backend examples launch processes independently in each pod #89

MPI backend examples launch processes independently in each pod #89

Comments

jwwandy commented Oct 24, 2018

johnugeorge commented Oct 24, 2018

Akado2009 commented Nov 20, 2018

jwwandy commented Nov 20, 2018 • edited Loading

Akado2009 commented Nov 20, 2018

jwwandy commented Nov 20, 2018 • edited Loading

Akado2009 commented Nov 20, 2018

ilchemla commented Jul 8, 2019

johnugeorge commented Jul 8, 2019

jtfogarty commented Jan 14, 2020

jwwandy commented Nov 20, 2018 •

edited

Loading

jwwandy commented Nov 20, 2018 •

edited

Loading