Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Multinode kubeflow example using the MPI Operator. #587

Closed
wants to merge 4 commits into from

Conversation

supertetelman
Copy link
Collaborator

This is a self-contained example that can be run on Kubeflow.

A user can download this notebook, run it on any Jupyter container through Kubeflow or on their Kubernetes down, and follow the steps.

It creates an initial pipeline that will download/parse data and a second pipeline that will create an MPI Job.

Everything is being pulled off of NGC and is dynamically creating data volumes.

I still need to test that the training completes properly in a few configurations (and add some details around timing). There may also need to be some additional details around cleanup. I wanted to open this PR a little early to share it with a few people for content review.

@supertetelman
Copy link
Collaborator Author

It looks like this is failing when I try to scale it out. Something with how this example is trying to mount the data volume across all workers is causing an error. I'll have to spend some time debugging this before we can merge PR.

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    <unknown>            default-scheduler  Successfully assigned kubeflow/mpi-d9d9dc80-worker-0 to gpu01
  Warning  FailedMount  7m (x2 over 9m14s)   kubelet, gpu01     Unable to attach or mount volumes: unmounted volumes=[code-pvc data-pvc], unattached volumes=[code-pvc data-pvc mpi-job-config default-token-22hh6]: timed out waiting for the condition
  Warning  FailedMount  5m4s (x11 over 11m)  kubelet, gpu01     MountVolume.SetUp failed for volume "pvc-0e1f0fed-c765-4edd-8cf6-94b3170449ab" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-0e1f0fed-c765-4edd-8cf6-94b3170449ab for pod kubeflow/mpi-d9d9dc80-worker-0. Volume is already attached by pod kubeflow/mpi-d9d9dc80-launcher-6sxfv. Status Pending
  Warning  FailedMount  4m42s                kubelet, gpu01     Unable to attach or mount volumes: unmounted volumes=[code-pvc data-pvc], unattached volumes=[mpi-job-config default-token-22hh6 code-pvc data-pvc]: timed out waiting for the condition
  Warning  FailedMount  60s (x13 over 11m)   kubelet, gpu01     MountVolume.SetUp failed for volume "pvc-0ff81c3f-28fd-4c8b-85eb-bcb003eb02e8" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-0ff81c3f-28fd-4c8b-85eb-bcb003eb02e8 for pod kubeflow/mpi-d9d9dc80-worker-0. Volume is already attached by pod kubeflow/mpi-d9d9dc80-launcher-6sxfv. Status Pending

@michael-balint michael-balint self-assigned this Aug 6, 2020
@supertetelman
Copy link
Collaborator Author

Temporarily closing this PR until I can fix this example.

@supertetelman
Copy link
Collaborator Author

Looks like better support was added for PVs in distributed training jobs: kubeflow/common#19

@supertetelman
Copy link
Collaborator Author

supertetelman commented Nov 19, 2020

All the previous issues have been resolved and with the inclusion of the nfs-client-provisioner shared storage is no longer an issue.

In order for this demo to be seamless I will need to address the issues described in the links below. Essentially pipelines only support namespaced-object by default in GCP and this functionality is not in the on-prem installation by defaut. So some work will need to be done to enable this config. It may be a change to the Kubeflow install packages or some additional roles/bindings created in K8s, or some slightly different Kubeflow APIs, or likely a combination of those things.

kubeflow/pipelines#4746

https://docs.google.com/document/d/1Ws4X1oNlaczhESNuEanZxbF-cnSfO78B1rBHWOkIAzo/edit#heading=h.ug06an51cdc8

I also moved to work to supertetelman:kubeflow-mpi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants