Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPICH support #562

Merged
merged 16 commits into from
Jun 16, 2023
Merged

MPICH support #562

merged 16 commits into from
Jun 16, 2023

Conversation

sheevy
Copy link
Contributor

@sheevy sheevy commented Jun 8, 2023

This is meant to be a continuation of #478 The original PR was stale, and master moved a lot, so it was easier to just create a new PR. I hope that is ok.

These are meant to be the same changes as in the #478, but rebased on top of current master. The main problem with previous PR was the fact that SlotsPerWorker used enviroment variable to control number of slots, but unfortunately such variable does not exist in case for MPICH. Suggested solution was to add number of slots per worker to hostfile. This PR does not implement this, because it was already done in #523

I hope that's correct understanding.

@google-oss-prow google-oss-prow bot requested review from tenzen-y and zw0610 June 8, 2023 17:29
@tenzen-y
Copy link
Member

tenzen-y commented Jun 8, 2023

Thank you for creating this PR!
I will review this tomorrow.

@tenzen-y tenzen-y mentioned this pull request Jun 8, 2023
7 tasks
@tenzen-y
Copy link
Member

tenzen-y commented Jun 8, 2023

@terrytangyuan Can you approve CI?

@terrytangyuan
Copy link
Member

Looks failed

@tenzen-y
Copy link
Member

tenzen-y commented Jun 8, 2023

@sheevy Can you address the error in CI?

@sheevy
Copy link
Contributor Author

sheevy commented Jun 8, 2023

Do you have any hints where to look? I couldn't see anything in the logs which would point in the right direction

@tenzen-y
Copy link
Member

tenzen-y commented Jun 8, 2023

Do you have any hints where to look? I couldn't see anything in the logs which would point in the right direction

CI says we need to regenerate manifests. Did you run make generate on your local?

@sheevy
Copy link
Contributor Author

sheevy commented Jun 8, 2023

I have not. Thanks for the hint, I will check that tomorrow.

@tenzen-y
Copy link
Member

tenzen-y commented Jun 8, 2023

Also, you can reproduce error with make verify-generate on your local.

mpi-operator/Makefile

Lines 89 to 91 in fda0532

.PHONY: verify-generate
verify-generate: generate
git --no-pager diff --exit-code manifests/base deploy sdk pkg/apis pkg/client

pkg/controller/mpi_job_controller.go Outdated Show resolved Hide resolved
build/base/mpich-entrypoint.sh Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
examples/v2beta1/pi/mpich.Dockerfile Outdated Show resolved Hide resolved
@@ -35,7 +35,8 @@ var (

validMPIImplementations = sets.NewString(
string(kubeflow.MPIImplementationOpenMPI),
string(kubeflow.MPIImplementationIntel))
string(kubeflow.MPIImplementationIntel),
string(kubeflow.MPIImplementationMPICH))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm yet waiting for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in now. I tried to refactor into a common function, so both Intel and MPICH tests just call it. As the part of that refactoring "valid with worker" tests uses has different restart policy, then it had before. However, I'm fine with that as it's not what's being tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Let me check.

test/e2e/mpi_job_test.go Show resolved Hide resolved
pkg/controller/mpi_job_controller.go Show resolved Hide resolved
@tenzen-y tenzen-y mentioned this pull request Jun 9, 2023
@google-oss-prow google-oss-prow bot added size/L and removed size/XXL labels Jun 9, 2023
@tenzen-y
Copy link
Member

tenzen-y commented Jun 9, 2023

Also, can you update the PR title with MPICH support?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM excepted a unit test for validation.

/assign @alculquicondor

@alculquicondor
Copy link
Collaborator

Thanks @tenzen-y. I will check on Monday.

@sheevy sheevy changed the title MPICH support, take 2 MPICH support Jun 9, 2023
@sheevy
Copy link
Contributor Author

sheevy commented Jun 9, 2023

I think I've implemented all the comments and suggestions. I'm happy for it to get re-tested. Please let me know if you have any further feedback.

@sheevy
Copy link
Contributor Author

sheevy commented Jun 13, 2023

Hey @terrytangyuan, can you approve another run, fixes for all suggestions are in?

@@ -35,7 +35,8 @@ var (

validMPIImplementations = sets.NewString(
string(kubeflow.MPIImplementationOpenMPI),
string(kubeflow.MPIImplementationIntel))
string(kubeflow.MPIImplementationIntel),
string(kubeflow.MPIImplementationMPICH))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Let me check.

Comment on lines 29 to 30
func GenerateValidJob(mpiImplementation kubeflow.MPIImplementation, hasWorkers bool) kubeflow.MPIJob {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we don't prefer to generate test data in a unit test programmably. If we introduce programmable test data, I prefer to use method chains.

@alculquicondor WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will adjust the code once there's a recommendation.
My goal was to reduce repetition, to generate a valid test for both Intel and MPICH using the same code. Is it possible to achieve using method chains? Or shall I just have a copy for each MPI implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagined things like kueue: https://github.com/kubernetes-sigs/kueue/blob/bd3caa823f0318d636ac39241938da987b31a9bf/pkg/util/testing/wrappers.go

However, wrappers may be overkill for the mpi-operator.
Let me know what @alculquicondor thinks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on this function.

https://enterprisecraftsmanship.com/posts/dry-damp-unit-tests/

We also don't need wrappers for so few cases. Just be expressive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with @alculquicondor

@terrytangyuan
Copy link
Member

Sure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this necessary?

In OpenMPI, we don't need this file, because OpenMPI is failure tolerant and implements retries when a host is not found.

IntelMPI just fails. What happens with MPICH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall having tested it last year, during the previous PR. And it was required. This makes sense, since IntelMPI is based on MPICH.

@@ -0,0 +1,7 @@
FROM debian:bullseye as builder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should be using a newer version, but given that mpioperator/base still uses bullseye, this makes sense.

If you have the chance, I would welcome a PR to update the base image everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Happy to do it in a separate PR.

examples/v2beta1/pi/mpich.Dockerfile Show resolved Hide resolved
Comment on lines 29 to 30
func GenerateValidJob(mpiImplementation kubeflow.MPIImplementation, hasWorkers bool) kubeflow.MPIJob {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on this function.

https://enterprisecraftsmanship.com/posts/dry-damp-unit-tests/

We also don't need wrappers for so few cases. Just be expressive.

pkg/controller/mpi_job_controller.go Show resolved Hide resolved
pkg/controller/mpi_job_controller.go Show resolved Hide resolved
@@ -217,6 +217,54 @@ var _ = ginkgo.Describe("MPIJob", func() {

})

ginkgo.Context("with MPICH Implementation", func() {
ginkgo.When("running as root", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have another test for running as user?

Copy link
Collaborator

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks for adding the test for Intel as well!

@google-oss-prow google-oss-prow bot added the lgtm label Jun 16, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 21f326d into kubeflow:master Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants