[WIP] Multinode kubeflow example using the MPI Operator. #587
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a self-contained example that can be run on Kubeflow.
A user can download this notebook, run it on any Jupyter container through Kubeflow or on their Kubernetes down, and follow the steps.
It creates an initial pipeline that will download/parse data and a second pipeline that will create an MPI Job.
Everything is being pulled off of NGC and is dynamically creating data volumes.
I still need to test that the training completes properly in a few configurations (and add some details around timing). There may also need to be some additional details around cleanup. I wanted to open this PR a little early to share it with a few people for content review.