-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] [RayJobs] Use finalizers to implement stopping a job upon cluster deletion #735
[Feature] [RayJobs] Use finalizers to implement stopping a job upon cluster deletion #735
Conversation
@Basasuya Do you mind helping me review this PR? Thank you! |
return err | ||
} | ||
// StopJob only returns an error when JobStatus is not in terminated states (STOPPED / SUCCEEDED / FAILED) | ||
if (jobInfo.JobStatus == rayv1alpha1.JobStatusPending) || (jobInfo.JobStatus == rayv1alpha1.JobStatusRunning) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we might change the states over time, I would suggest
- Defining the terminated states in a constant somewhere.
- Have this condition check if we're not terminated, using that constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I will update it tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I decided not to replace isJobSucceedOrFailed
and isJobPendingOrRunning
with IsJobTerminal
in this PR because current RayJob controller does not handle jobs with STOPPED
statuses explicitly. I will open an issue to track the progress.
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 239 to 253 in 792af2a
// isJobSucceedOrFailed indicates whether the job comes into end status. | |
func isJobSucceedOrFailed(status rayv1alpha1.JobStatus) bool { | |
if status == rayv1alpha1.JobStatusSucceeded || status == rayv1alpha1.JobStatusFailed { | |
return true | |
} | |
return false | |
} | |
// isJobPendingOrRunning indicates whether the job is running. | |
func isJobPendingOrRunning(status rayv1alpha1.JobStatus) bool { | |
if status == rayv1alpha1.JobStatusPending || status == rayv1alpha1.JobStatusRunning { | |
return true | |
} | |
return false | |
} |
nits Also, cc @architkulkarni. |
overall looks good to me. /cc @Basasuya please have a check |
Seems one tricky problem is the stop job failed and finalizer has been removed. at least once won't guarantee the job stopped. In this case we need to reply on ray core's exact once semantics |
@kevin85421 and I talked about this. The idea is to be initially defensive, to avoid triggering a situation where there's an issue stopping the job and the user has to remove the finalizer manually. The "at least one attempt" semantics in this PR are better than the current behavior, which doesn't handle missed delete events. |
Once we're more confident in this, we can have it try to guarantee stopped jobs by retrying until the job is successfully stopped and only delete finalizer on success. |
Hi, I think for ray service #647, @sihanwang41 also put out the request that we also need the similar mechanism and we may reuse some codes from here. would it possible for put these two finalizer features into the 0.4.0 release? |
@kevin85421 @Jeffwan it looks good to me 👍 |
This PR is approved by 3 reviewers. Merge it. |
…luster deletion (ray-project#735) See ray-project#629 to get more context. The behavior of this PR is almost the same as ray-project#629. The only difference is that this PR promises that operator will try to stop the job at least once. In ray-project#629, if the RayJob is deleted when the operator is down, the operator will not try to stop the job.
Why are these changes needed?
See #629 to get more context. The behavior of this PR is almost the same as #629. The only difference is that this PR promises that operator will try to stop the job at least once. In #629, if the RayJob is deleted when the operator is down, the operator will not try to stop the job.
Finalizer reference
Need to discuss
kubectl patch rayjobs.ray.io/rayjob-sample-2 --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'
Related issue number
Closes #676
#629
Checks
We should add tests for RayJob when the integration tests are stable enough. Use #664 to track the progress.
Manual tests
Step1: Install KubeRay operator
Step2: Install a RayCluster
YAML
Step3: Submit a RayJob with
clusterSelector
YAML
Step4: Use Dashboard or curl command to check the job status.
curl localhost:8265/api/jobs/rayjob-sample-2-j2hnn
(RESTful API doc)Step5: Use
kubectl describe rayjobs.ray.io rayjob-sample-2
to check the finalizer "ray.io/rayjob-finalizer".Example
Step6: Delete the RayJob
kubectl delete rayjobs.ray.io rayjob-sample-2
. The finalizer in Step5 will be removed by the operator. Hence, RayJob can be removed successfully.Step7: Check operator's log
Others (Feel free to ignore this part)
curl -X POST -H 'Content-Type: application/json' localhost:8265/api/jobs/02000000/stop
STOPPED
status,{"stopped": false}
will be returned.