Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run workers first and wait for them #484

Merged
merged 21 commits into from
Jun 26, 2023
Merged

Conversation

xhejtman
Copy link
Contributor

Signed-off-by: Lukas Hejtmanek [email protected]

this is patch so that launcher waits for workers. Code works. Missing part is how to let it be configurable. Perhaps add an option into CRD at the same level as sshAuthMountPath? E.g., waitForWorkers: true|false?

@alculquicondor
Copy link
Collaborator

yes, please add a configuration option.

Maybe the boolean is fine. Another proposal could be something along these lines:

launcherStartup: OnCreation|AfterWorkersReady

But the boolean might be more readable.

@alculquicondor
Copy link
Collaborator

Fix lint errors.
Other than that @ahg-g said he could have a look

@xhejtman
Copy link
Contributor Author

Any hit how to fix them? It seems to be in file that I didn't touch at all..

@alculquicondor
Copy link
Collaborator

oh, it just looks like the linter is broken

@terrytangyuan do you know how to upgrade it?

@terrytangyuan
Copy link
Member

oh, it just looks like the linter is broken

@terrytangyuan do you know how to upgrade it?

Should be able to upgrade here: https://github.com/kubeflow/mpi-operator/blob/master/.github/workflows/golangci-lint.yml#L19

@alculquicondor
Copy link
Collaborator

@xhejtman I upgraded the linter #485
Can you rebase?

Copy link

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests please?

Comment on lines 920 to 922
var (
launcherPodsCnt int
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launcherPodsCnt := 0

func (c *MPIJobController) countReadyWorkerPods(workers []*corev1.Pod) (int) {
ready := 0
for _, pod := range workers {
for _, c := range pod.Status.Conditions {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can replace this loop with IsStatusConditionTrue(pod.Status.Conditions, corev1.PodReady) from pkg "k8s.io/apimachinery/pkg/api/meta"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not seem to:

cannot convert pod.Status.Conditions (type []"k8s.io/api/core/v1".PodCondition) to type []"k8s.io/apimachinery/pkg/apis/meta/v1".Condition

return fmt.Errorf("creating launcher Pod: %w", err)
}
} else {
klog.Infof("Waiting for workers to start...")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use klog.V(4) and indicate the job name/namespace pls

@@ -55,6 +55,9 @@ type MPIJobSpec struct {
// +kubebuilder:default:="/root/.ssh"
SSHAuthMountPath string `json:"sshAuthMountPath,omitempty"`

// Spawn launcher after all workers are in Ready state if true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment needs to start with the parameter name WaitForWorkers; something like:

Suggested change
// Spawn launcher after all workers are in Ready state if true
// WaitForWorkers if true, the launcher is created only after all workers are in Ready state

return fmt.Errorf("creating launcher Pod: %w", err)
}
} else {
klog.Infof("Waiting for workers to start...")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what event will trigger reconciling this job to create the launcher later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is triggered by pod events (running/ready), but not quite sure, however, tested and it is triggered.

@xhejtman
Copy link
Contributor Author

Can we proceed here?

@tenzen-y
Copy link
Member

Can we proceed here?

@xhejtman, Can you rebase this PR? We moved all codes for v2 to the top of the repository.

@alculquicondor
Copy link
Collaborator

I guess this will have to wait for the next release

@tenzen-y
Copy link
Member

tenzen-y commented Apr 4, 2023

I guess this will have to wait for the next release

I agree. if @xhejtman has no progress, I will take over this PR.

@xhejtman
Copy link
Contributor Author

xhejtman commented Apr 4, 2023

Sorry for delay. I can rebase it, if you you already know what to change, please let me know, there were some proposed changes but I think there were not possible or I am not skilled enough to do them in go ;)

@alculquicondor
Copy link
Collaborator

The thing is that we are trying to cut the release today and I rather not delay it further. If you still have time to follow up on the rebase, you can certainly do so. But we wouldn't merge until a couple of days from now.

@xhejtman xhejtman force-pushed the wait branch 2 times, most recently from df327cb to f975ef5 Compare April 4, 2023 21:31
@xhejtman
Copy link
Contributor Author

xhejtman commented Apr 4, 2023

I did rebase but some tests are failing. Can you advise how to fix it?

@xhejtman
Copy link
Contributor Author

xhejtman commented Apr 4, 2023

I cleaned the mess of merges. However, tests are really failing.

@xhejtman xhejtman reopened this Apr 4, 2023
@google-oss-prow google-oss-prow bot added size/M and removed size/XS labels Apr 4, 2023
xhejtman added 2 commits June 22, 2023 22:50
Signed-off-by: Lukas Hejtmanek <[email protected]>
Signed-off-by: Lukas Hejtmanek <[email protected]>
@alculquicondor
Copy link
Collaborator

rebase instead of merge

Signed-off-by: Lukas Hejtmanek <[email protected]>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhejtman Thanks for the updates! I left a comment for a nit.

pkg/apis/kubeflow/v2beta1/types.go Outdated Show resolved Hide resolved
Signed-off-by: Lukas Hejtmanek <[email protected]>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhejtman Thanks!
/lgtm
/assign @alculquicondor

@@ -154,6 +167,9 @@ type MPIJobSpec struct {
// +kubebuilder:default:="/root/.ssh"
SSHAuthMountPath string `json:"sshAuthMountPath,omitempty"`

// launcherCreationPolicy if WaitForWorkersReady, the launcher is created only after all workers are in Ready state. Defaults to AtStartup.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:

// +kubebuilder:validation:Enum:AtStartup;WaitForWorkersReady
// +kubebuilder:default:=AtStartup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just below line 170?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Signed-off-by: Lukas Hejtmanek <[email protected]>
@google-oss-prow google-oss-prow bot removed the lgtm label Jun 26, 2023
@alculquicondor
Copy link
Collaborator

you need to regenerate the CRDs

Signed-off-by: Lukas Hejtmanek <[email protected]>
@alculquicondor
Copy link
Collaborator

Thanks, can you squash?

@xhejtman
Copy link
Contributor Author

Thanks, can you squash?

don't know what is it.

@tenzen-y
Copy link
Member

Thanks, can you squash?

don't know what is it.

That means rebasing and squashing all commits into one.
https://git-scm.com/docs/user-manual#interactive-rebase

@xhejtman
Copy link
Contributor Author

Well, couldn't it be done at your side? I am not skilled with git much, just basic commit/push

@tenzen-y
Copy link
Member

Well, couldn't it be done at your side? I am not skilled with git much, just basic commit/push

I can do that. But if so, I will become the main committer and you are the co-author. Are you ok?

@xhejtman
Copy link
Contributor Author

Well, couldn't it be done at your side? I am not skilled with git much, just basic commit/push

I can do that. But if so, I will become the main committer and you are the co-author. Are you ok?

No problem at all, I don't need any credit here ;) Thanks!

@tenzen-y
Copy link
Member

Well, couldn't it be done at your side? I am not skilled with git much, just basic commit/push

I can do that. But if so, I will become the main committer and you are the co-author. Are you ok?

No problem at all, I don't need any credit here ;) Thanks!

OK. Let me try.

@alculquicondor
Copy link
Collaborator

I think I can force the squash. Let me try

@alculquicondor
Copy link
Collaborator

nvm

The bot does squash by default.

For example: fda0532

@alculquicondor
Copy link
Collaborator

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Collaborator

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jun 26, 2023
@tenzen-y
Copy link
Member

nvm

The bot does squash by default.

For example: fda0532

Ah, right.
@xhejtman Thanks for your contribution!
/lgtm

@google-oss-prow google-oss-prow bot merged commit f8d815c into kubeflow:master Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants