-
Notifications
You must be signed in to change notification settings - Fork 143
MPI backend examples launch processes independently in each pod #89
Comments
@jwwandy, greetings. For now, as far as I know -n option in mpirun species the number of copies of process to run, not just the number of containers/pods. |
@Akado2009 Nice to hear from you. Exactly as you mentioned, the -n option in mpirun species the number of copies of process to run and should schedule them on different MPI nodes(no matter how many there are). What I'm confused is that it seems like there's no mechanism for current examples to discover pods for openMPI as a MPI cluster for launching process across pods. Each pod functioned as an independent MPI cluster launching its own group of processes. If the examples are just aimed to launch multiple process independently on each pod, rather than do distributed training across pods, then the current example works well. However, I assume part of usage of kubeflow is to do distributed training across multiple pods, which I can only achieve by adding ssh key after pod created as openmpi documents https://www.open-mpi.org/faq/?category=rsh for now. |
@jwwandy Sorry for the late response, was busy working. But yeah, you're right, this example treats each pod as a separate openMPI cluster. I was thinking about making an upgraded version of this example, so that it treats you k8s cluster as an openmpi cluster, then you're job is gonna be real distributed. |
@Akado2009 Thanks to make it clear. Although my current workaround is quite dirty by writing a shell script and downward API to setup all of ssh stuff after pods creation, I think they could(and should) be done by controller.
Wish these short steps might help some. |
@jwwandy Yes, I agree, that it should be done by the controller. |
Any news about this issue? |
Can mpi-operator solve your issue? What is your use case? |
/area operator |
https://github.com/kubeflow/pytorch-operator/blob/master/examples/ddp/mnist/gpu/v1alpha2/job_mnist_DDP_GPU.yaml
When launching MPI backend jobs examples above with
ENTRYPOINT ["mpirun", "-n", "4", "--allow-run-as-root", "python", "-u", "/opt/pytorch_dist_mnist/mnist_ddp_gpu.py"]
in Dockerfile,I expected to do distributed training where it launched 1 process on each pod(totally 4, with 1 master and 3 workers).However, it seems like it launched 4 processes on each pod and trained independently.
Is there anything I misunderstood of this examples?
The text was updated successfully, but these errors were encountered: