-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPICH support #562
MPICH support #562
Conversation
Thank you for creating this PR! |
@terrytangyuan Can you approve CI? |
Looks failed |
@sheevy Can you address the error in CI? |
Do you have any hints where to look? I couldn't see anything in the logs which would point in the right direction |
CI says we need to regenerate manifests. Did you run |
I have not. Thanks for the hint, I will check that tomorrow. |
Also, you can reproduce error with Lines 89 to 91 in fda0532
|
@@ -35,7 +35,8 @@ var ( | |||
|
|||
validMPIImplementations = sets.NewString( | |||
string(kubeflow.MPIImplementationOpenMPI), | |||
string(kubeflow.MPIImplementationIntel)) | |||
string(kubeflow.MPIImplementationIntel), | |||
string(kubeflow.MPIImplementationMPICH)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm yet waiting for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be in now. I tried to refactor into a common function, so both Intel and MPICH tests just call it. As the part of that refactoring "valid with worker" tests uses has different restart policy, then it had before. However, I'm fine with that as it's not what's being tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Let me check.
Also, can you update the PR title with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM excepted a unit test for validation.
/assign @alculquicondor
Thanks @tenzen-y. I will check on Monday. |
I think I've implemented all the comments and suggestions. I'm happy for it to get re-tested. Please let me know if you have any further feedback. |
Hey @terrytangyuan, can you approve another run, fixes for all suggestions are in? |
@@ -35,7 +35,8 @@ var ( | |||
|
|||
validMPIImplementations = sets.NewString( | |||
string(kubeflow.MPIImplementationOpenMPI), | |||
string(kubeflow.MPIImplementationIntel)) | |||
string(kubeflow.MPIImplementationIntel), | |||
string(kubeflow.MPIImplementationMPICH)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Let me check.
func GenerateValidJob(mpiImplementation kubeflow.MPIImplementation, hasWorkers bool) kubeflow.MPIJob { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, we don't prefer to generate test data in a unit test programmably. If we introduce programmable test data, I prefer to use method chains.
@alculquicondor WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will adjust the code once there's a recommendation.
My goal was to reduce repetition, to generate a valid test for both Intel and MPICH using the same code. Is it possible to achieve using method chains? Or shall I just have a copy for each MPI implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagined things like kueue: https://github.com/kubernetes-sigs/kueue/blob/bd3caa823f0318d636ac39241938da987b31a9bf/pkg/util/testing/wrappers.go
However, wrappers may be overkill for the mpi-operator.
Let me know what @alculquicondor thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on this function.
https://enterprisecraftsmanship.com/posts/dry-damp-unit-tests/
We also don't need wrappers for so few cases. Just be expressive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with @alculquicondor
Sure |
build/base/entrypoint.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this necessary?
In OpenMPI, we don't need this file, because OpenMPI is failure tolerant and implements retries when a host is not found.
IntelMPI just fails. What happens with MPICH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall having tested it last year, during the previous PR. And it was required. This makes sense, since IntelMPI is based on MPICH.
@@ -0,0 +1,7 @@ | |||
FROM debian:bullseye as builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we should be using a newer version, but given that mpioperator/base
still uses bullseye, this makes sense.
If you have the chance, I would welcome a PR to update the base image everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. Happy to do it in a separate PR.
func GenerateValidJob(mpiImplementation kubeflow.MPIImplementation, hasWorkers bool) kubeflow.MPIJob { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on this function.
https://enterprisecraftsmanship.com/posts/dry-damp-unit-tests/
We also don't need wrappers for so few cases. Just be expressive.
test/e2e/mpi_job_test.go
Outdated
@@ -217,6 +217,54 @@ var _ = ginkgo.Describe("MPIJob", func() { | |||
|
|||
}) | |||
|
|||
ginkgo.Context("with MPICH Implementation", func() { | |||
ginkgo.When("running as root", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have another test for running as user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Thanks for adding the test for Intel as well!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is meant to be a continuation of #478 The original PR was stale, and master moved a lot, so it was easier to just create a new PR. I hope that is ok.
These are meant to be the same changes as in the #478, but rebased on top of current master. The main problem with previous PR was the fact that SlotsPerWorker used enviroment variable to control number of slots, but unfortunately such variable does not exist in case for MPICH. Suggested solution was to add number of slots per worker to hostfile. This PR does not implement this, because it was already done in #523
I hope that's correct understanding.