-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run workers first and wait for them #484
Conversation
yes, please add a configuration option. Maybe the boolean is fine. Another proposal could be something along these lines: launcherStartup: OnCreation|AfterWorkersReady But the boolean might be more readable. |
Fix lint errors. |
Any hit how to fix them? It seems to be in file that I didn't touch at all.. |
oh, it just looks like the linter is broken @terrytangyuan do you know how to upgrade it? |
Should be able to upgrade here: https://github.com/kubeflow/mpi-operator/blob/master/.github/workflows/golangci-lint.yml#L19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests please?
var ( | ||
launcherPodsCnt int | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
launcherPodsCnt := 0
func (c *MPIJobController) countReadyWorkerPods(workers []*corev1.Pod) (int) { | ||
ready := 0 | ||
for _, pod := range workers { | ||
for _, c := range pod.Status.Conditions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can replace this loop with IsStatusConditionTrue(pod.Status.Conditions, corev1.PodReady)
from pkg "k8s.io/apimachinery/pkg/api/meta"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not seem to:
cannot convert pod.Status.Conditions (type []"k8s.io/api/core/v1".PodCondition) to type []"k8s.io/apimachinery/pkg/apis/meta/v1".Condition
return fmt.Errorf("creating launcher Pod: %w", err) | ||
} | ||
} else { | ||
klog.Infof("Waiting for workers to start...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use klog.V(4) and indicate the job name/namespace pls
@@ -55,6 +55,9 @@ type MPIJobSpec struct { | |||
// +kubebuilder:default:="/root/.ssh" | |||
SSHAuthMountPath string `json:"sshAuthMountPath,omitempty"` | |||
|
|||
// Spawn launcher after all workers are in Ready state if true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment needs to start with the parameter name WaitForWorkers
; something like:
// Spawn launcher after all workers are in Ready state if true | |
// WaitForWorkers if true, the launcher is created only after all workers are in Ready state |
return fmt.Errorf("creating launcher Pod: %w", err) | ||
} | ||
} else { | ||
klog.Infof("Waiting for workers to start...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what event will trigger reconciling this job to create the launcher later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is triggered by pod events (running/ready), but not quite sure, however, tested and it is triggered.
Can we proceed here? |
@xhejtman, Can you rebase this PR? We moved all codes for v2 to the top of the repository. |
I guess this will have to wait for the next release |
I agree. if @xhejtman has no progress, I will take over this PR. |
Sorry for delay. I can rebase it, if you you already know what to change, please let me know, there were some proposed changes but I think there were not possible or I am not skilled enough to do them in go ;) |
The thing is that we are trying to cut the release today and I rather not delay it further. If you still have time to follow up on the rebase, you can certainly do so. But we wouldn't merge until a couple of days from now. |
df327cb
to
f975ef5
Compare
I did rebase but some tests are failing. Can you advise how to fix it? |
Signed-off-by: Lukas Hejtmanek <[email protected]>
I cleaned the mess of merges. However, tests are really failing. |
Signed-off-by: Lukas Hejtmanek <[email protected]>
Signed-off-by: Lukas Hejtmanek <[email protected]>
rebase instead of merge |
Signed-off-by: Lukas Hejtmanek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhejtman Thanks for the updates! I left a comment for a nit.
Signed-off-by: Lukas Hejtmanek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhejtman Thanks!
/lgtm
/assign @alculquicondor
@@ -154,6 +167,9 @@ type MPIJobSpec struct { | |||
// +kubebuilder:default:="/root/.ssh" | |||
SSHAuthMountPath string `json:"sshAuthMountPath,omitempty"` | |||
|
|||
// launcherCreationPolicy if WaitForWorkersReady, the launcher is created only after all workers are in Ready state. Defaults to AtStartup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add:
// +kubebuilder:validation:Enum:AtStartup;WaitForWorkersReady
// +kubebuilder:default:=AtStartup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just below line 170?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Signed-off-by: Lukas Hejtmanek <[email protected]>
you need to regenerate the CRDs |
Signed-off-by: Lukas Hejtmanek <[email protected]>
Thanks, can you squash? |
don't know what is it. |
That means rebasing and squashing all commits into one. |
Well, couldn't it be done at your side? I am not skilled with git much, just basic commit/push |
I can do that. But if so, I will become the main committer and you are the co-author. Are you ok? |
No problem at all, I don't need any credit here ;) Thanks! |
OK. Let me try. |
I think I can force the squash. Let me try |
nvm The bot does squash by default. For example: fda0532 |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Signed-off-by: Lukas Hejtmanek [email protected]
this is patch so that launcher waits for workers. Code works. Missing part is how to let it be configurable. Perhaps add an option into CRD at the same level as
sshAuthMountPath
? E.g.,waitForWorkers: true|false
?