MPICH support #562

sheevy · 2023-06-08T17:29:29Z

This is meant to be a continuation of #478 The original PR was stale, and master moved a lot, so it was easier to just create a new PR. I hope that is ok.

These are meant to be the same changes as in the #478, but rebased on top of current master. The main problem with previous PR was the fact that SlotsPerWorker used enviroment variable to control number of slots, but unfortunately such variable does not exist in case for MPICH. Suggested solution was to add number of slots per worker to hostfile. This PR does not implement this, because it was already done in #523

I hope that's correct understanding.

This reverts commit 15164a8.

tenzen-y · 2023-06-08T17:41:43Z

Thank you for creating this PR!
I will review this tomorrow.

tenzen-y · 2023-06-08T19:02:02Z

@terrytangyuan Can you approve CI?

terrytangyuan · 2023-06-08T20:14:10Z

Looks failed

tenzen-y · 2023-06-08T20:15:20Z

@sheevy Can you address the error in CI?

sheevy · 2023-06-08T20:31:24Z

Do you have any hints where to look? I couldn't see anything in the logs which would point in the right direction

tenzen-y · 2023-06-08T20:43:46Z

Do you have any hints where to look? I couldn't see anything in the logs which would point in the right direction

CI says we need to regenerate manifests. Did you run make generate on your local?

sheevy · 2023-06-08T20:46:01Z

I have not. Thanks for the hint, I will check that tomorrow.

tenzen-y · 2023-06-08T20:46:12Z

Also, you can reproduce error with make verify-generate on your local.

mpi-operator/Makefile

Lines 89 to 91 in fda0532

    
           .PHONY: verify-generate 
        
           verify-generate: generate 
        
           	git --no-pager diff --exit-code manifests/base deploy sdk pkg/apis pkg/client

pkg/controller/mpi_job_controller.go

build/base/mpich-entrypoint.sh

Makefile

examples/v2beta1/pi/mpich.Dockerfile

tenzen-y · 2023-06-09T05:44:00Z

pkg/apis/kubeflow/validation/validation.go

@@ -35,7 +35,8 @@ var (

 	validMPIImplementations = sets.NewString(
 		string(kubeflow.MPIImplementationOpenMPI),
-		string(kubeflow.MPIImplementationIntel))
+		string(kubeflow.MPIImplementationIntel),
+		string(kubeflow.MPIImplementationMPICH))


Should we update https://github.com/kubeflow/mpi-operator/blob/fda0532ba1a25e48b68316759c2dec438cfc8d13/pkg/apis/kubeflow/validation/validation_test.go as well?

I'm yet waiting for this.

Should be in now. I tried to refactor into a common function, so both Intel and MPICH tests just call it. As the part of that refactoring "valid with worker" tests uses has different restart policy, then it had before. However, I'm fine with that as it's not what's being tested.

Thanks. Let me check.

test/e2e/mpi_job_test.go

pkg/controller/mpi_job_controller.go

tenzen-y · 2023-06-09T15:16:20Z

Also, can you update the PR title with MPICH support?

tenzen-y

Overall LGTM excepted a unit test for validation.

/assign @alculquicondor

alculquicondor · 2023-06-09T15:22:57Z

Thanks @tenzen-y. I will check on Monday.

sheevy · 2023-06-09T16:21:33Z

I think I've implemented all the comments and suggestions. I'm happy for it to get re-tested. Please let me know if you have any further feedback.

sheevy · 2023-06-13T08:53:19Z

Hey @terrytangyuan, can you approve another run, fixes for all suggestions are in?

tenzen-y · 2023-06-13T12:28:12Z

pkg/apis/kubeflow/validation/validation.go

@@ -35,7 +35,8 @@ var (

 	validMPIImplementations = sets.NewString(
 		string(kubeflow.MPIImplementationOpenMPI),
-		string(kubeflow.MPIImplementationIntel))
+		string(kubeflow.MPIImplementationIntel),
+		string(kubeflow.MPIImplementationMPICH))


Thanks. Let me check.

tenzen-y · 2023-06-13T12:37:33Z

pkg/apis/kubeflow/validation/validation_test.go

+func GenerateValidJob(mpiImplementation kubeflow.MPIImplementation, hasWorkers bool) kubeflow.MPIJob {
+


Generally, we don't prefer to generate test data in a unit test programmably. If we introduce programmable test data, I prefer to use method chains.

@alculquicondor WDYT?

Sure, I will adjust the code once there's a recommendation.
My goal was to reduce repetition, to generate a valid test for both Intel and MPICH using the same code. Is it possible to achieve using method chains? Or shall I just have a copy for each MPI implementation?

I imagined things like kueue: https://github.com/kubernetes-sigs/kueue/blob/bd3caa823f0318d636ac39241938da987b31a9bf/pkg/util/testing/wrappers.go

However, wrappers may be overkill for the mpi-operator.
Let me know what @alculquicondor thinks.

-1 on this function.

https://enterprisecraftsmanship.com/posts/dry-damp-unit-tests/

We also don't need wrappers for so few cases. Just be expressive.

+1 with @alculquicondor

terrytangyuan · 2023-06-13T16:13:47Z

Sure

alculquicondor · 2023-06-16T13:10:35Z

build/base/entrypoint.sh

Was this necessary?

In OpenMPI, we don't need this file, because OpenMPI is failure tolerant and implements retries when a host is not found.

IntelMPI just fails. What happens with MPICH?

I recall having tested it last year, during the previous PR. And it was required. This makes sense, since IntelMPI is based on MPICH.

alculquicondor · 2023-06-16T13:13:25Z

build/base/mpich-builder.Dockerfile

@@ -0,0 +1,7 @@
+FROM debian:bullseye as builder


Ideally, we should be using a newer version, but given that mpioperator/base still uses bullseye, this makes sense.

If you have the chance, I would welcome a PR to update the base image everywhere.

Noted. Happy to do it in a separate PR.

examples/v2beta1/pi/mpich.Dockerfile

alculquicondor · 2023-06-16T13:20:28Z

pkg/apis/kubeflow/validation/validation_test.go

+func GenerateValidJob(mpiImplementation kubeflow.MPIImplementation, hasWorkers bool) kubeflow.MPIJob {
+


-1 on this function.

https://enterprisecraftsmanship.com/posts/dry-damp-unit-tests/

We also don't need wrappers for so few cases. Just be expressive.

pkg/controller/mpi_job_controller.go

alculquicondor · 2023-06-16T13:28:38Z

test/e2e/mpi_job_test.go

@@ -217,6 +217,54 @@ var _ = ginkgo.Describe("MPIJob", func() {

 	})

+	ginkgo.Context("with MPICH Implementation", func() {
+		ginkgo.When("running as root", func() {


Can we also have another test for running as user?

alculquicondor

/lgtm
/approve

Thanks for adding the test for Intel as well!

google-oss-prow · 2023-06-16T17:57:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sheevy added 7 commits June 7, 2023 05:07

Add support for MPICH

1c9f98e

Fix CI errors

651c1c8

Temporary: manual trigger

15164a8

Fix file name

599cbd0

Add an empty line at the end of the file

79212fe

Fix formatting

e0e5bac

Revert "Temporary: manual trigger"

d0164e7

This reverts commit 15164a8.

google-oss-prow bot requested review from tenzen-y and zw0610 June 8, 2023 17:29

google-oss-prow bot added the size/XXL label Jun 8, 2023

tenzen-y mentioned this pull request Jun 8, 2023

Release 0.5.0 #563

Closed

7 tasks

tenzen-y reviewed Jun 9, 2023

View reviewed changes

tenzen-y mentioned this pull request Jun 9, 2023

MPICH support #478

Closed

sheevy added 2 commits June 9, 2023 14:15

fix formatting

d882562

Regenerate the mpi-operator.yaml

e147bb0

google-oss-prow bot added size/L and removed size/XXL labels Jun 9, 2023

sheevy added 3 commits June 9, 2023 14:38

Adding an empy line at the end of Dockerfiles

04e75f7

Share the same entrypoin for Intel and MPICH

5423613

share hostfile generation between Intel and MPICH

c25ac22

tenzen-y reviewed Jun 9, 2023

View reviewed changes

google-oss-prow bot assigned alculquicondor Jun 9, 2023

Add validation test for MPICH

aab2cc0

sheevy changed the title ~~MPICH support, take 2~~ MPICH support Jun 9, 2023

Fix formatting

e864520

tenzen-y reviewed Jun 13, 2023

View reviewed changes

alculquicondor reviewed Jun 16, 2023

View reviewed changes

sheevy added 2 commits June 16, 2023 16:30

Don't over engineer the tests - be explicit

a68241a

add non-root tests for IntelMPI and MPICH

56d4e8e

alculquicondor reviewed Jun 16, 2023

View reviewed changes

google-oss-prow bot added the lgtm label Jun 16, 2023

google-oss-prow bot added the approved label Jun 16, 2023

google-oss-prow bot merged commit 21f326d into kubeflow:master Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPICH support #562

MPICH support #562

sheevy commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

terrytangyuan commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

sheevy commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

sheevy commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

tenzen-y Jun 9, 2023

tenzen-y Jun 9, 2023

sheevy Jun 9, 2023

tenzen-y Jun 13, 2023

tenzen-y commented Jun 9, 2023

tenzen-y left a comment

alculquicondor commented Jun 9, 2023

sheevy commented Jun 9, 2023

sheevy commented Jun 13, 2023

tenzen-y Jun 13, 2023

tenzen-y Jun 13, 2023

sheevy Jun 13, 2023

tenzen-y Jun 13, 2023

alculquicondor Jun 16, 2023

tenzen-y Jun 16, 2023

terrytangyuan commented Jun 13, 2023

alculquicondor Jun 16, 2023

sheevy Jun 16, 2023

alculquicondor Jun 16, 2023

sheevy Jun 16, 2023

alculquicondor Jun 16, 2023

alculquicondor Jun 16, 2023

alculquicondor left a comment

google-oss-prow bot commented Jun 16, 2023

		func GenerateValidJob(mpiImplementation kubeflow.MPIImplementation, hasWorkers bool) kubeflow.MPIJob {

MPICH support #562

MPICH support #562

Conversation

sheevy commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

terrytangyuan commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

sheevy commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

sheevy commented Jun 8, 2023

tenzen-y commented Jun 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Jun 9, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

alculquicondor commented Jun 9, 2023

sheevy commented Jun 9, 2023

sheevy commented Jun 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrytangyuan commented Jun 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Jun 16, 2023