-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] 409 conflicts cause two jobs creations #756
Comments
This is most unfortunate. |
It'll be hard to change the Ray Jobs API because it's GA already. I like your idea of submitting a job with a name; the RayJob operator should probably generate its own random names and it should manage the state and retries. Currently the Ray Jobs API fails the submit request if the name is already in use. |
The possible solution is to combine RayCluster creation and job execution into an atomic operation. For example, we can use |
The thing you have to be careful about with postStart hooks is that their execution is asynchronous relative to the container entrypoint (there's no ordering guarantee). Related Slack discussion: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959 |
Thank you for this information! |
This issue is relieved a lot by #1000, but the PR does not fix the root cause. |
…g DashboardHTTPClient (#1177) This PR changes the way job submission is handled. Prior to this PR, the job controller reconciliation loop would check the Status of the job, and based on that, decide whether to submit the job or not. This design had a bug that would sometimes result in duplicate job submissions; see #756 for a full description of the root cause (in a nutshell, the Status might not be updated to "Running" immediately.) This PR fixes the issue because it no longer uses the Status field of the job to determine whether to submit the job upon each reconciliation. Instead, it creates a K8s job to submit the Ray job. A k8s job will not duplicated even if it's created multiple times. The PR also exposes the pod template for the k8s job pod that submits the job so that the user can supply it in their RayJob manifest as needed. In the typical case, the user should not need to specify this, the default should be sufficient. This is the first PR in a larger refactor of RayJob; this PR just contains a localized change in the submission path. See https://docs.google.com/document/d/1G--fKMCqp-M3T0qDPQy5eOQJQKRmubr7SyZQ_ms6Py4/edit for more details about the design. Related issue number Closes #756 --------- Signed-off-by: Archit Kulkarni <[email protected]>
…g DashboardHTTPClient (ray-project#1177) This PR changes the way job submission is handled. Prior to this PR, the job controller reconciliation loop would check the Status of the job, and based on that, decide whether to submit the job or not. This design had a bug that would sometimes result in duplicate job submissions; see ray-project#756 for a full description of the root cause (in a nutshell, the Status might not be updated to "Running" immediately.) This PR fixes the issue because it no longer uses the Status field of the job to determine whether to submit the job upon each reconciliation. Instead, it creates a K8s job to submit the Ray job. A k8s job will not duplicated even if it's created multiple times. The PR also exposes the pod template for the k8s job pod that submits the job so that the user can supply it in their RayJob manifest as needed. In the typical case, the user should not need to specify this, the default should be sufficient. This is the first PR in a larger refactor of RayJob; this PR just contains a localized change in the submission path. See https://docs.google.com/document/d/1G--fKMCqp-M3T0qDPQy5eOQJQKRmubr7SyZQ_ms6Py4/edit for more details about the design. Related issue number Closes ray-project#756 --------- Signed-off-by: Archit Kulkarni <[email protected]>
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I submitted a RayJob to an existing RayCluster with
clusterSelector
and then deleted it. I repeated the process 6 times, and 2 (11:13:54 & 11:32:05) of them created two jobs rather than one job.I checked the log and found 409 conflicts occurred at both 11:13:54 & 11:32:05. The root cause is that
SubmitJob()
(Link) succeeded butr.Status().Update(ctx, rayJob)
(Link) failed. So, the job submit again in the next iteration of reconciliation.Reproduction script
Follow the section "Manual tests" in #735. You need to try multiple times to reproduce the bug.
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: