Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RayService: zero downtime update and healthcheck HA recovery #307

Merged
merged 49 commits into from
Jun 25, 2022

Conversation

brucez-anyscale
Copy link
Contributor

@brucez-anyscale brucez-anyscale commented Jun 14, 2022

Why are these changes needed?

This pr supports:

  1. Zero downtime RayCluster config update
  2. RayCluster failure recovery with mostly zero downtime.

Design:
Whenever config updated or ray cluster unhealthy for a certain time, create a new ray cluster and send serve deployment request. If the new cluster is ready for serving, switch the traffic by updating ingress and service.

The reconciler follows:

  1. Get RayService instance.
  2. Get or Create RayCluster: if needs a new RayCluster, assign a cluster name to the pending cluster status. If the pending cluster name is not empty, create RayCluster instance.
  3. Check the RayCluster and Serve Deployments health status, deploy serve deployments when needed.
  4. Create/update ingress or service when needed.

Manual test

Test in EKS with RayCluster config update

2022-06-15T21:44:41.850Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:41.850Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:41.865Z	INFO	rayservice-controller	Done reconcileRayCluster update status
2022-06-15T21:44:41.865Z	INFO	rayservice-controller	Enter next loop to create new ray cluster.
2022-06-15T21:44:41.866Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:41.879Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:41.881Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:41.896Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Service \"rayservice-sample-raycluster-9jdv6-head-svc\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:41.896Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:41.896Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:41.903Z	INFO	raycluster-controller	Pod Service created successfully	{"service name": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:41.904Z	INFO	raycluster-controller	reconcilePods 	{"creating head pod for cluster": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:41.904Z	INFO	RayCluster-Controller	Setting pod namespaces	{"namespace": "default"}
2022-06-15T21:44:41.904Z	INFO	RayCluster-Controller	Head pod container with index 0 identified as Ray container.
2022-06-15T21:44:41.904Z	INFO	raycluster-controller	createHeadPod	{"head pod with name": "rayservice-sample-raycluster-9jdv6-head-"}
2022-06-15T21:44:41.915Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Service \"rayservice-sample-raycluster-9jdv6-head-svc\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:41.915Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:41.916Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:42.080Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:42.080Z	INFO	controllers.RayCluster	reconcileServices 	{"head service found": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:42.081Z	INFO	raycluster-controller	reconcilePods 	{"head pod found": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:42.081Z	INFO	raycluster-controller	reconcilePods	{"head pod is up and running... checking workers": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:42.095Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:42.095Z	INFO	controllers.RayCluster	reconcileServices 	{"head service found": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:42.095Z	INFO	raycluster-controller	reconcilePods 	{"head pod found": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:42.095Z	INFO	raycluster-controller	reconcilePods	{"head pod is up and running... checking workers": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:43.082Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:43.098Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:43.098Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:43.099Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.053Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:44.053Z	INFO	controllers.RayCluster	reconcileServices 	{"head service found": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:44.053Z	INFO	raycluster-controller	reconcilePods 	{"head pod found": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:44.053Z	INFO	raycluster-controller	reconcilePods	{"head pod is up and running... checking workers": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:44.073Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:44.073Z	INFO	controllers.RayCluster	reconcileServices 	{"head service found": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:44.073Z	INFO	raycluster-controller	reconcilePods 	{"head pod found": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:44.073Z	INFO	raycluster-controller	reconcilePods	{"head pod is up and running... checking workers": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:44.139Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.148Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "combined error: Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused Operation cannot be fulfilled on rayservices.ray.io \"rayservice-sample\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "combined error: Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused Operation cannot be fulfilled on rayservices.ray.io \"rayservice-sample\": the object has been modified; please apply your changes to the latest version and try again\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).updateState\n\t/workspace/controllers/ray/rayservice_controller.go:254\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/workspace/controllers/ray/rayservice_controller.go:162\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.148Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:44.149Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.152Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.166Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.167Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:44.167Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.171Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.182Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.189Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:44.190Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.192Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.203Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.523Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:44.524Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.527Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.541Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:45.181Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:45.181Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:45.184Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:45.205Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:45.205Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:45.205Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:45.208Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:45.218Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:46.485Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:46.485Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:46.488Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:46.506Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:46.506Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:46.507Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:46.510Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:46.524Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:51.627Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:51.627Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:54.348Z	INFO	rayservice-controller	Check serve health	{"isHealthy": true}
2022-06-15T21:44:54.398Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:54.399Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:54.571Z	INFO	rayservice-controller	Check serve health	{"isHealthy": true}

Status snapshot

status:
  activeServiceStatus:
    dashboardStatus:
      healthLastUpdateTime: "2022-06-17T04:34:33Z"
      isHealthy: true
      lastUpdateTime: "2022-06-17T04:34:33Z"
    rayClusterName: rayservice-sample-raycluster-n87zt
    rayClusterStatus:
      lastUpdateTime: "2022-06-17T04:34:13Z"
    serveDeploymentStatuses:
    - healthLastUpdateTime: "2022-06-17T04:34:33Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: shallow
      status: HEALTHY
    - healthLastUpdateTime: "2022-06-17T04:34:33Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: deep
      status: HEALTHY
    - healthLastUpdateTime: "2022-06-17T04:34:33Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: one
      status: HEALTHY
  pendingServiceStatus:
    dashboardStatus:
      healthLastUpdateTime: "2022-06-17T04:34:33Z"
      isHealthy: true
      lastUpdateTime: "2022-06-17T04:34:33Z"
    rayClusterName: rayservice-sample-raycluster-gv7cc
    rayClusterStatus: {}
    serveDeploymentStatuses:
    - healthLastUpdateTime: "2022-06-17T04:34:31Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: shallow
      status: UPDATING
    - healthLastUpdateTime: "2022-06-17T04:34:31Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: deep
      status: UPDATING
    - healthLastUpdateTime: "2022-06-17T04:34:31Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: one
      status: UPDATING
  serviceStatus: Running

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@DmitriGekhtman
Copy link
Collaborator

PR description, please :)
Also, shall we add more reviewers?

@brucez-anyscale brucez-anyscale marked this pull request as draft June 15, 2022 05:13
@brucez-anyscale
Copy link
Contributor Author

PR description, please :) Also, shall we add more reviewers?

Still under draft mode, want to collect some early feedback, if you have time. Thanks!

@brucez-anyscale brucez-anyscale requested a review from simon-mo June 17, 2022 06:08
Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, mostly questions/discussion.
Didn't quite make it through to reading the entire PR.

These changes look pretty involved -- could you explain the overall control flow a bit more in the PR description?

ray-operator/apis/ray/v1alpha1/rayservice_types.go Outdated Show resolved Hide resolved
// Pending Service Status indicates a RayCluster will be created or is under creating.
PendingServiceStatus RayServiceStatus `json:"pendingServiceStatus,omitempty"`
// ServiceStatus indicates the current RayService status.
ServiceStatus ServiceStatus `json:"serviceStatus,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServiceStatus sounds very similar to RayServiceStatus, which is a little confusing. Not sure how to resolve.

(Might have better suggestions when I read through the rest of the PR.)

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jun 22, 2022

-- could you explain the overall control flow a bit more in the PR description?

We'll eventually also want to explain that in detail in the docs.


// GenerateRayClusterName generates a ray cluster name from ray service name
func GenerateRayClusterName(serviceName string) string {
return fmt.Sprintf("%s%s%s", serviceName, RayClusterSuffix, rand.String(5))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned about naming collisions due to the truncation that happens all over the place

func CheckName(s string) string {
,
but maybe this is unlikely to cause issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is a point we should be careful.

Comment on lines +194 to +196
Owns(&rayv1alpha1.RayCluster{}).
Owns(&corev1.Service{}).
Owns(&networkingv1.Ingress{}).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this stream service and ingress update into controller?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what about the configmap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have configmap. If service or ingress updates, the controller will know

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that we want to detect, say, if someone deletes the RayService's K8s Service.
There's no config map :)

// BuildIngressForRayService Builds the ingress for head service dashboard for RayService.
// This is used to expose dashboard for external traffic.
// RayService controller updates the ingress whenever a new RayCluster serves the traffic.
func BuildIngressForRayService(service rayiov1alpha1.RayService, cluster rayiov1alpha1.RayCluster) (*networkingv1.Ingress, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need an ingress here? Just a Service resource is useful enough to call the port both internally and externally. We should expect users to build their own ingress right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea was to match the behavior of the RayCluster controller, which does have the ability to make an ingress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it is the same as RayCluster as Dmitri said.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jun 24, 2022

I think it looks good overall.

Could you add even more function descriptions -- for example, if there are functions called checkIfXXXneeded, the comment on the function should explain in English what the condition is for XXX to be needed.
Would actually be great if every function had a description, especially given the logic is fairly complex.

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@brucez-anyscale brucez-anyscale merged commit 0ba6f88 into master Jun 25, 2022
Comment on lines +525 to +536
serveStatuses.ApplicationStatus.LastUpdateTime = &timeNow
serveStatuses.ApplicationStatus.HealthLastUpdateTime = &timeNow
if serveStatuses.ApplicationStatus.Status != "HEALTHY" {
// Check previously app status.
if rayServiceServeStatus.ApplicationStatus.Status != "HEALTHY" {
serveStatuses.ApplicationStatus.HealthLastUpdateTime = rayServiceServeStatus.ApplicationStatus.HealthLastUpdateTime

if rayServiceServeStatus.ApplicationStatus.HealthLastUpdateTime != nil && time.Since(rayServiceServeStatus.ApplicationStatus.HealthLastUpdateTime.Time).Seconds() > ServeDeploymentUnhealthySecondThreshold {
isHealthy = false
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shrekris-anyscale Thanks for the discussion. Here is the bug code.

@DmitriGekhtman DmitriGekhtman deleted the brucez/improveHA branch December 3, 2022 00:01
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ject#307)

* draft for ha

* import fmt

* debug ingress

* Draft service

* update

* fix

* Update service logic

* update

* update

* Logs

* update

* debug

* Update

* Update

* Update

* update

* Fix cluster start flaky issue

* update

* Update service and ingress

* update rbac

* Draft v1

* Update

* address comments

* Address comments and refactor codes

* update

* Fix lint issue

* update

* Fix unit tests

* goImport

* Update unit tests

* Implement unit tests

* Change preparing to pending

* goimports

* update

* Improve the pr to show both statuses

* Improve the pr to show both statuses

* update to align with latest serve status

* update

* Fix ut and imports

* update

* update

* address comments

* update

* update delete ray cluster logic

* update delete ray cluster logic

* update

* address comments

* update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants