-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running #1583
Conversation
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
err = r.updateState(ctx, rayJobInstance, jobInfo, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusFailedToGetJobStatus, err) | ||
// Dashboard service in head pod takes time to start, it's possible we get connection refused error. | ||
// Requeue after few seconds to avoid continuous connection errors. | ||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | ||
} | ||
|
||
// Update RayJob.Status (Kubernetes CR) from Ray Job Status from Dashboard service | ||
if jobInfo != nil && jobInfo.JobStatus != rayJobInstance.Status.JobStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the bug reproduction by killing the head pod, the previous and current JobStatus
were both RUNNING
so we wouldn't go into this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another question for the issue #1489. Does KubeRay respect TTL when the status is JobDeploymentStatusFailedToGetJobStatus
?
Signed-off-by: Archit Kulkarni <[email protected]>
@kevin85421 No, the shutdown logic is only entered if the JobStatus is SUCCEEDED or FAILED. It is currently done by comparing a timestamp with JobInfo.end_time which is populated by Ray and retrieved via HTTP. I will create a followup issue to respect TTL in the case where the JobStatus cannot be retrieved. |
Could you please open an issue to track the TTL issue? Thanks! Because this PR does not have tests, I will clone your fork to test it manually. |
I followed the reproduction script. When I killed the head Pod, the submitter K8s Job reached its backoffLimit within 3 seconds. Therefore, when the head Pod became ready again, the submitter K8s Job would no longer submit the Ray job. |
Discussed with @architkulkarni offline. The reproduction in the PR description cannot test this PR properly. Hence, Archit will work on tests or reproduction scripts to simulate: (1) head Pod is always healthy (2) A status check request from KubeRay is dropped due to some network issues. |
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@kevin85421 Adding mocks for an end to end test in CI turned out to be cumbersome. I also tried to do a manual test using For now I have added a unit test, let me know if this is sufficient. |
Signed-off-by: Archit Kulkarni <[email protected]>
Unit tests are not sufficient to prove this PR's correctness, but I am OK with not adding e2e tests and testing this PR manually instead because (1) We have a plan to refactor the RayJob codebase after the v1.0.0 release (2) This PR is a release blocker for KubeRay v1.0.0.
Wow. I will try it. |
@architkulkarni Could you try to use # Add the following securityContext to the Ray head configuration in your RayJob YAML.
securityContext:
capabilities:
add: ['NET_ADMIN']
# Log in to the head Pod and install `curl` and `iptables`.
sudo apt-get update; sudo apt-get install -y curl iptables
# Try to connect to the Ray dashboard
curl -X GET 127.0.0.1:8265
# <!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="shortcut icon" href="./favicon.ico"/><meta name="viewport" content="width=device-width,initial-scale=1"/><title>Ray Dashboard</title><script defer="defer" src="./static/js/main.1f147255.js"></script><link href="./static/css/main.388a904b.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>
# Drop all incoming packets to the dashboard port.
sudo iptables -A INPUT -p tcp --dport 8265 -j DROP
# Try to connect to the Ray dashboard again.
curl -X GET 127.0.0.1:8265
# curl: (28) Failed to connect to 127.0.0.1 port 8265: Connection timed out |
@kevin85421 Thanks for the manual test instructions! It works and the fix works correctly. After disabling the packets and waiting about 1-2 minutes:
After deleting the blocking rule with
|
The job reconciler pings the job status endpoint every 3 seconds. Previously, if this failed at any time during the job execution, the job deployment status would be set to FailedToGetJobStatus, and it would never be updated back to JobDeploymentStatusRunning because due to an oversight, the existing code only updates the JobDeploymentStatus if the JobStatus changed. Thus in the case where the JobStatus doesn't change (e.g. "RUNNING" -> "RUNNING"), but there is an intermittent failure to get the job status, the JobDeploymentStatus is never updated from JobDeploymentStatusFailedToGetJobStatus to JobDeploymentStatusRunning.
This PR fixes this bug by explicitly updating the status back to JobDeploymentStatusRunning if the reconcile loop gets a JobStatus when the previous status is FailedToGetJobStatus.
Testing: unit test
Why are these changes needed?
Related issue number
Closes #1489
Checks