Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 409 conflicts cause two jobs creations #756

Closed
2 tasks done
kevin85421 opened this issue Nov 22, 2022 · 6 comments · Fixed by #1177
Closed
2 tasks done

[Bug] 409 conflicts cause two jobs creations #756

kevin85421 opened this issue Nov 22, 2022 · 6 comments · Fixed by #1177
Assignees
Labels
bug Something isn't working rayjob stability Pertains to basic infrastructure stability

Comments

@kevin85421
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I submitted a RayJob to an existing RayCluster with clusterSelector and then deleted it. I repeated the process 6 times, and 2 (11:13:54 & 11:32:05) of them created two jobs rather than one job.

Screen Shot 2022-11-22 at 11 32 23 AM

I checked the log and found 409 conflicts occurred at both 11:13:54 & 11:32:05. The root cause is that SubmitJob() (Link) succeeded but r.Status().Update(ctx, rayJob) (Link) failed. So, the job submit again in the next iteration of reconciliation.

2022-11-22T19:13:54.262Z        ERROR   controller.rayjob       Reconciler error        {"reconciler group": "ray.io", "reconciler kind": "RayJob", "name": "rayjob-sample-2", "namespace": "default", "error": "combined error: <nil> Operation cannot be fulfilled on rayjobs.ray.io \"rayjob-sample-2\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "combined error: <nil> Operation cannot be fulfilled on rayjobs.ray.io \"rayjob-sample-2\": the object has been modified; please apply your changes to the latest version and try again\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).updateState\n\t/workspace/controllers/ray/rayjob_controller.go:350\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile\n\t/workspace/controllers/ray/rayjob_controller.go:202\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
.
.
.
2022-11-22T19:32:05.487Z        ERROR   controller.rayjob       Reconciler error        {"reconciler group": "ray.io", "reconciler kind": "RayJob", "name": "rayjob-sample-2", "namespace": "default", "error": "combined error: <nil> Operation cannot be fulfilled on rayjobs.ray.io \"rayjob-sample-2\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "combined error: <nil> Operation cannot be fulfilled on rayjobs.ray.io \"rayjob-sample-2\": the object has been modified; please apply your changes to the latest version and try again\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).updateState\n\t/workspace/controllers/ray/rayjob_controller.go:350\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile\n\t/workspace/controllers/ray/rayjob_controller.go:202\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227

Reproduction script

Follow the section "Manual tests" in #735. You need to try multiple times to reproduce the bug.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added the bug Something isn't working label Nov 22, 2022
@kevin85421 kevin85421 self-assigned this Nov 22, 2022
@DmitriGekhtman
Copy link
Collaborator

This is most unfortunate.
@architkulkarni do you think there's anything we could do to make job submission idempotent?
Maybe we could submit the job with a name? Submitting a job with the same name could either fail (preferred) or cancel the old job (this sounds harder).

@architkulkarni
Copy link
Contributor

It'll be hard to change the Ray Jobs API because it's GA already. I like your idea of submitting a job with a name; the RayJob operator should probably generate its own random names and it should manage the state and retries. Currently the Ray Jobs API fails the submit request if the name is already in use.

@kevin85421
Copy link
Member Author

The possible solution is to combine RayCluster creation and job execution into an atomic operation. For example, we can use postStart hook (Link) to execute the job.

@DmitriGekhtman
Copy link
Collaborator

The thing you have to be careful about with postStart hooks is that their execution is asynchronous relative to the container entrypoint (there's no ordering guarantee).

Related Slack discussion: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959

@kevin85421
Copy link
Member Author

The thing you have to be careful about with postStart hooks is that their execution is asynchronous relative to the container entrypoint (there's no ordering guarantee).

Related Slack discussion: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959

Thank you for this information!

@kevin85421
Copy link
Member Author

This issue is relieved a lot by #1000, but the PR does not fix the root cause.

architkulkarni added a commit that referenced this issue Jun 28, 2023
…g DashboardHTTPClient (#1177)

This PR changes the way job submission is handled. Prior to this PR, the job controller reconciliation loop would check the Status of the job, and based on that, decide whether to submit the job or not. This design had a bug that would sometimes result in duplicate job submissions; see #756 for a full description of the root cause (in a nutshell, the Status might not be updated to "Running" immediately.)

This PR fixes the issue because it no longer uses the Status field of the job to determine whether to submit the job upon each reconciliation. Instead, it creates a K8s job to submit the Ray job. A k8s job will not duplicated even if it's created multiple times.

The PR also exposes the pod template for the k8s job pod that submits the job so that the user can supply it in their RayJob manifest as needed. In the typical case, the user should not need to specify this, the default should be sufficient.

This is the first PR in a larger refactor of RayJob; this PR just contains a localized change in the submission path. See https://docs.google.com/document/d/1G--fKMCqp-M3T0qDPQy5eOQJQKRmubr7SyZQ_ms6Py4/edit for more details about the design.

Related issue number
Closes #756

---------

Signed-off-by: Archit Kulkarni <[email protected]>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this issue Sep 24, 2023
…g DashboardHTTPClient (ray-project#1177)

This PR changes the way job submission is handled. Prior to this PR, the job controller reconciliation loop would check the Status of the job, and based on that, decide whether to submit the job or not. This design had a bug that would sometimes result in duplicate job submissions; see ray-project#756 for a full description of the root cause (in a nutshell, the Status might not be updated to "Running" immediately.)

This PR fixes the issue because it no longer uses the Status field of the job to determine whether to submit the job upon each reconciliation. Instead, it creates a K8s job to submit the Ray job. A k8s job will not duplicated even if it's created multiple times.

The PR also exposes the pod template for the k8s job pod that submits the job so that the user can supply it in their RayJob manifest as needed. In the typical case, the user should not need to specify this, the default should be sufficient.

This is the first PR in a larger refactor of RayJob; this PR just contains a localized change in the submission path. See https://docs.google.com/document/d/1G--fKMCqp-M3T0qDPQy5eOQJQKRmubr7SyZQ_ms6Py4/edit for more details about the design.

Related issue number
Closes ray-project#756

---------

Signed-off-by: Archit Kulkarni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rayjob stability Pertains to basic infrastructure stability
Projects
None yet
3 participants