[WIP] Multinode kubeflow example using the MPI Operator. #587

supertetelman · 2020-07-31T21:25:55Z

This is a self-contained example that can be run on Kubeflow.

A user can download this notebook, run it on any Jupyter container through Kubeflow or on their Kubernetes down, and follow the steps.

It creates an initial pipeline that will download/parse data and a second pipeline that will create an MPI Job.

Everything is being pulled off of NGC and is dynamically creating data volumes.

I still need to test that the training completes properly in a few configurations (and add some details around timing). There may also need to be some additional details around cleanup. I wanted to open this PR a little early to share it with a few people for content review.

supertetelman · 2020-08-03T00:34:16Z

It looks like this is failing when I try to scale it out. Something with how this example is trying to mount the data volume across all workers is causing an error. I'll have to spend some time debugging this before we can merge PR.

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    <unknown>            default-scheduler  Successfully assigned kubeflow/mpi-d9d9dc80-worker-0 to gpu01
  Warning  FailedMount  7m (x2 over 9m14s)   kubelet, gpu01     Unable to attach or mount volumes: unmounted volumes=[code-pvc data-pvc], unattached volumes=[code-pvc data-pvc mpi-job-config default-token-22hh6]: timed out waiting for the condition
  Warning  FailedMount  5m4s (x11 over 11m)  kubelet, gpu01     MountVolume.SetUp failed for volume "pvc-0e1f0fed-c765-4edd-8cf6-94b3170449ab" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-0e1f0fed-c765-4edd-8cf6-94b3170449ab for pod kubeflow/mpi-d9d9dc80-worker-0. Volume is already attached by pod kubeflow/mpi-d9d9dc80-launcher-6sxfv. Status Pending
  Warning  FailedMount  4m42s                kubelet, gpu01     Unable to attach or mount volumes: unmounted volumes=[code-pvc data-pvc], unattached volumes=[mpi-job-config default-token-22hh6 code-pvc data-pvc]: timed out waiting for the condition
  Warning  FailedMount  60s (x13 over 11m)   kubelet, gpu01     MountVolume.SetUp failed for volume "pvc-0ff81c3f-28fd-4c8b-85eb-bcb003eb02e8" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-0ff81c3f-28fd-4c8b-85eb-bcb003eb02e8 for pod kubeflow/mpi-d9d9dc80-worker-0. Volume is already attached by pod kubeflow/mpi-d9d9dc80-launcher-6sxfv. Status Pending

supertetelman · 2020-08-06T23:27:48Z

Temporarily closing this PR until I can fix this example.

supertetelman · 2020-09-25T22:59:26Z

Looks like better support was added for PVs in distributed training jobs: kubeflow/common#19

supertetelman · 2020-11-19T05:45:33Z

All the previous issues have been resolved and with the inclusion of the nfs-client-provisioner shared storage is no longer an issue.

In order for this demo to be seamless I will need to address the issues described in the links below. Essentially pipelines only support namespaced-object by default in GCP and this functionality is not in the on-prem installation by defaut. So some work will need to be done to enable this config. It may be a change to the Kubeflow install packages or some additional roles/bindings created in K8s, or some slightly different Kubeflow APIs, or likely a combination of those things.

kubeflow/pipelines#4746

https://docs.google.com/document/d/1Ws4X1oNlaczhESNuEanZxbF-cnSfO78B1rBHWOkIAzo/edit#heading=h.ug06an51cdc8

I also moved to work to supertetelman:kubeflow-mpi.

supertetelman added 4 commits July 31, 2020 14:22

Multinode kubeflow example using the MPI Operator.

6da5e20

Add pointer to mpi example in kubeflow doc

99ca559

Fix mpijob params in example

817f774

Use proper volumemounts in kubeflow mpi example

30919cf

michael-balint self-assigned this Aug 6, 2020

supertetelman closed this Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Multinode kubeflow example using the MPI Operator. #587

[WIP] Multinode kubeflow example using the MPI Operator. #587

supertetelman commented Jul 31, 2020

supertetelman commented Aug 3, 2020

supertetelman commented Aug 6, 2020

supertetelman commented Sep 25, 2020

supertetelman commented Nov 19, 2020 •

edited

Loading

[WIP] Multinode kubeflow example using the MPI Operator. #587

[WIP] Multinode kubeflow example using the MPI Operator. #587

Conversation

supertetelman commented Jul 31, 2020

supertetelman commented Aug 3, 2020

supertetelman commented Aug 6, 2020

supertetelman commented Sep 25, 2020

supertetelman commented Nov 19, 2020 • edited Loading

supertetelman commented Nov 19, 2020 •

edited

Loading