RayService: zero downtime update and healthcheck HA recovery #307

brucez-anyscale · 2022-06-14T23:24:11Z

Why are these changes needed?

This pr supports:

Zero downtime RayCluster config update
RayCluster failure recovery with mostly zero downtime.

Design:
Whenever config updated or ray cluster unhealthy for a certain time, create a new ray cluster and send serve deployment request. If the new cluster is ready for serving, switch the traffic by updating ingress and service.

The reconciler follows:

Get RayService instance.
Get or Create RayCluster: if needs a new RayCluster, assign a cluster name to the pending cluster status. If the pending cluster name is not empty, create RayCluster instance.
Check the RayCluster and Serve Deployments health status, deploy serve deployments when needed.
Create/update ingress or service when needed.

Manual test

Test in EKS with RayCluster config update

2022-06-15T21:44:41.850Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:41.850Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:41.865Z	INFO	rayservice-controller	Done reconcileRayCluster update status
2022-06-15T21:44:41.865Z	INFO	rayservice-controller	Enter next loop to create new ray cluster.
2022-06-15T21:44:41.866Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:41.879Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:41.881Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:41.896Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Service \"rayservice-sample-raycluster-9jdv6-head-svc\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:41.896Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:41.896Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:41.903Z	INFO	raycluster-controller	Pod Service created successfully	{"service name": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:41.904Z	INFO	raycluster-controller	reconcilePods 	{"creating head pod for cluster": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:41.904Z	INFO	RayCluster-Controller	Setting pod namespaces	{"namespace": "default"}
2022-06-15T21:44:41.904Z	INFO	RayCluster-Controller	Head pod container with index 0 identified as Ray container.
2022-06-15T21:44:41.904Z	INFO	raycluster-controller	createHeadPod	{"head pod with name": "rayservice-sample-raycluster-9jdv6-head-"}
2022-06-15T21:44:41.915Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Service \"rayservice-sample-raycluster-9jdv6-head-svc\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:41.915Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:41.916Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:42.080Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:42.080Z	INFO	controllers.RayCluster	reconcileServices 	{"head service found": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:42.081Z	INFO	raycluster-controller	reconcilePods 	{"head pod found": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:42.081Z	INFO	raycluster-controller	reconcilePods	{"head pod is up and running... checking workers": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:42.095Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:42.095Z	INFO	controllers.RayCluster	reconcileServices 	{"head service found": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:42.095Z	INFO	raycluster-controller	reconcilePods 	{"head pod found": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:42.095Z	INFO	raycluster-controller	reconcilePods	{"head pod is up and running... checking workers": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:43.082Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:43.098Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:43.098Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:43.099Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.053Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:44.053Z	INFO	controllers.RayCluster	reconcileServices 	{"head service found": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:44.053Z	INFO	raycluster-controller	reconcilePods 	{"head pod found": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:44.053Z	INFO	raycluster-controller	reconcilePods	{"head pod is up and running... checking workers": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:44.073Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "rayservice-sample-raycluster-9jdv6"}
2022-06-15T21:44:44.073Z	INFO	controllers.RayCluster	reconcileServices 	{"head service found": "rayservice-sample-raycluster-9jdv6-head-svc"}
2022-06-15T21:44:44.073Z	INFO	raycluster-controller	reconcilePods 	{"head pod found": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:44.073Z	INFO	raycluster-controller	reconcilePods	{"head pod is up and running... checking workers": "rayservice-sample-raycluster-9jdv6-head-dmmcw"}
2022-06-15T21:44:44.139Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.148Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "combined error: Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused Operation cannot be fulfilled on rayservices.ray.io \"rayservice-sample\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "combined error: Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused Operation cannot be fulfilled on rayservices.ray.io \"rayservice-sample\": the object has been modified; please apply your changes to the latest version and try again\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).updateState\n\t/workspace/controllers/ray/rayservice_controller.go:254\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/workspace/controllers/ray/rayservice_controller.go:162\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.148Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:44.149Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.152Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.166Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.167Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:44.167Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.171Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.182Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.189Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:44.190Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.192Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.203Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.523Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:44.524Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:44.527Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:44.541Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:45.181Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:45.181Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:45.184Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:45.205Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:45.205Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:45.205Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:45.208Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:45.218Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:46.485Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:46.485Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:46.488Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:46.506Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:46.506Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:46.507Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:46.510Z	ERROR	controllers.RayService	fail to update deployment	{"error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile
	/workspace/controllers/ray/rayservice_controller.go:157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:46.524Z	ERROR	controller.rayservice	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayService", "name": "rayservice-sample", "namespace": "default", "error": "Put \"http://rayservice-sample-raycluster-9jdv6-head-svc.default.svc.cluster.local:8265/api/serve/deployments/\": dial tcp 10.100.36.128:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-06-15T21:44:51.627Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:51.627Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:54.348Z	INFO	rayservice-controller	Check serve health	{"isHealthy": true}
2022-06-15T21:44:54.398Z	INFO	rayservice-controller	reconciling RayService	{"service NamespacedName": "default/rayservice-sample"}
2022-06-15T21:44:54.399Z	INFO	rayservice-controller	Done reconcileRayCluster
2022-06-15T21:44:54.571Z	INFO	rayservice-controller	Check serve health	{"isHealthy": true}

Status snapshot

status:
  activeServiceStatus:
    dashboardStatus:
      healthLastUpdateTime: "2022-06-17T04:34:33Z"
      isHealthy: true
      lastUpdateTime: "2022-06-17T04:34:33Z"
    rayClusterName: rayservice-sample-raycluster-n87zt
    rayClusterStatus:
      lastUpdateTime: "2022-06-17T04:34:13Z"
    serveDeploymentStatuses:
    - healthLastUpdateTime: "2022-06-17T04:34:33Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: shallow
      status: HEALTHY
    - healthLastUpdateTime: "2022-06-17T04:34:33Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: deep
      status: HEALTHY
    - healthLastUpdateTime: "2022-06-17T04:34:33Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: one
      status: HEALTHY
  pendingServiceStatus:
    dashboardStatus:
      healthLastUpdateTime: "2022-06-17T04:34:33Z"
      isHealthy: true
      lastUpdateTime: "2022-06-17T04:34:33Z"
    rayClusterName: rayservice-sample-raycluster-gv7cc
    rayClusterStatus: {}
    serveDeploymentStatuses:
    - healthLastUpdateTime: "2022-06-17T04:34:31Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: shallow
      status: UPDATING
    - healthLastUpdateTime: "2022-06-17T04:34:31Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: deep
      status: UPDATING
    - healthLastUpdateTime: "2022-06-17T04:34:31Z"
      lastUpdateTime: "2022-06-17T04:34:33Z"
      name: one
      status: UPDATING
  serviceStatus: Running

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

DmitriGekhtman · 2022-06-15T03:24:40Z

PR description, please :)
Also, shall we add more reviewers?

brucez-anyscale · 2022-06-15T05:14:02Z

PR description, please :) Also, shall we add more reviewers?

Still under draft mode, want to collect some early feedback, if you have time. Thanks!

ray-operator/apis/ray/v1alpha1/rayservice_types.go

ray-operator/controllers/ray/common/service.go

ray-operator/controllers/ray/utils/util.go

ray-operator/controllers/ray/rayservice_controller.go

DmitriGekhtman

Left some comments, mostly questions/discussion.
Didn't quite make it through to reading the entire PR.

These changes look pretty involved -- could you explain the overall control flow a bit more in the PR description?

ray-operator/apis/ray/v1alpha1/raycluster_types.go

ray-operator/apis/ray/v1alpha1/rayservice_types.go

DmitriGekhtman · 2022-06-21T23:33:35Z

ray-operator/apis/ray/v1alpha1/rayservice_types.go

+	// Pending Service Status indicates a RayCluster will be created or is under creating.
+	PendingServiceStatus RayServiceStatus `json:"pendingServiceStatus,omitempty"`
+	// ServiceStatus indicates the current RayService status.
+	ServiceStatus ServiceStatus `json:"serviceStatus,omitempty"`


ServiceStatus sounds very similar to RayServiceStatus, which is a little confusing. Not sure how to resolve.

(Might have better suggestions when I read through the rest of the PR.)

ray-operator/apis/ray/v1alpha1/rayservice_types.go

ray-operator/controllers/ray/rayservice_controller.go

DmitriGekhtman · 2022-06-22T01:06:13Z

-- could you explain the overall control flow a bit more in the PR description?

We'll eventually also want to explain that in detail in the docs.

ray-operator/apis/ray/v1alpha1/rayservice_types.go

DmitriGekhtman · 2022-06-22T03:29:50Z

ray-operator/controllers/ray/utils/util.go

+
+// GenerateRayClusterName generates a ray cluster name from ray service name
+func GenerateRayClusterName(serviceName string) string {
+	return fmt.Sprintf("%s%s%s", serviceName, RayClusterSuffix, rand.String(5))


I'm slightly concerned about naming collisions due to the truncation that happens all over the place

kuberay/ray-operator/controllers/ray/utils/util.go

Line 24 in 37d3995

func CheckName(s string) string {

,
but maybe this is unlikely to cause issues.

yes, this is a point we should be careful.

ray-operator/controllers/ray/rayservice_controller.go

simon-mo · 2022-06-24T20:23:35Z

ray-operator/controllers/ray/rayservice_controller.go

+		Owns(&rayv1alpha1.RayCluster{}).
+		Owns(&corev1.Service{}).
+		Owns(&networkingv1.Ingress{}).


does this stream service and ingress update into controller?

also what about the configmap?

We do not have configmap. If service or ingress updates, the controller will know

I think the idea is that we want to detect, say, if someone deletes the RayService's K8s Service.
There's no config map :)

simon-mo · 2022-06-24T20:25:09Z

ray-operator/controllers/ray/common/ingress.go

+// BuildIngressForRayService Builds the ingress for head service dashboard for RayService.
+// This is used to expose dashboard for external traffic.
+// RayService controller updates the ingress whenever a new RayCluster serves the traffic.
+func BuildIngressForRayService(service rayiov1alpha1.RayService, cluster rayiov1alpha1.RayCluster) (*networkingv1.Ingress, error) {


do we need an ingress here? Just a Service resource is useful enough to call the port both internally and externally. We should expect users to build their own ingress right?

I think the idea was to match the behavior of the RayCluster controller, which does have the ability to make an ingress.

Right now it is the same as RayCluster as Dmitri said.

DmitriGekhtman · 2022-06-24T20:36:43Z

I think it looks good overall.

Could you add even more function descriptions -- for example, if there are functions called checkIfXXXneeded, the comment on the function should explain in English what the condition is for XXX to be needed.
Would actually be great if every function had a description, especially given the logic is fairly complex.

ray-operator/controllers/ray/rayservice_controller.go

DmitriGekhtman

LGTM, thanks!

brucez-anyscale · 2022-11-14T23:04:03Z

ray-operator/controllers/ray/rayservice_controller.go

+	serveStatuses.ApplicationStatus.LastUpdateTime = &timeNow
+	serveStatuses.ApplicationStatus.HealthLastUpdateTime = &timeNow
+	if serveStatuses.ApplicationStatus.Status != "HEALTHY" {
+		// Check previously app status.
+		if rayServiceServeStatus.ApplicationStatus.Status != "HEALTHY" {
+			serveStatuses.ApplicationStatus.HealthLastUpdateTime = rayServiceServeStatus.ApplicationStatus.HealthLastUpdateTime
+
+			if rayServiceServeStatus.ApplicationStatus.HealthLastUpdateTime != nil && time.Since(rayServiceServeStatus.ApplicationStatus.HealthLastUpdateTime.Time).Seconds() > ServeDeploymentUnhealthySecondThreshold {
+				isHealthy = false
+			}
+		}
+	}


@shrekris-anyscale Thanks for the discussion. Here is the bug code.

…ject#307) * draft for ha * import fmt * debug ingress * Draft service * update * fix * Update service logic * update * update * Logs * update * debug * Update * Update * Update * update * Fix cluster start flaky issue * update * Update service and ingress * update rbac * Draft v1 * Update * address comments * Address comments and refactor codes * update * Fix lint issue * update * Fix unit tests * goImport * Update unit tests * Implement unit tests * Change preparing to pending * goimports * update * Improve the pr to show both statuses * Improve the pr to show both statuses * update to align with latest serve status * update * Fix ut and imports * update * update * address comments * update * update delete ray cluster logic * update delete ray cluster logic * update * address comments * update

brucez-anyscale added 22 commits June 13, 2022 16:46

draft for ha

ec97cbd

import fmt

f0c7cc1

debug ingress

c086d62

Draft service

ee38f49

update

d618910

fix

22bd454

Update service logic

9a2a08e

update

96631d1

update

30b5a52

Logs

035235b

update

82ba89b

debug

bb89d09

Update

7bb1ee5

Update

9b75096

Update

8e069f1

update

bb99bfc

Fix cluster start flaky issue

d7591ad

update

76b6e71

Update service and ingress

76fc2d6

update rbac

837c362

Draft v1

c7338a7

Update

0f34a8f

brucez-anyscale requested a review from DmitriGekhtman June 15, 2022 00:49

brucez-anyscale marked this pull request as draft June 15, 2022 05:13

DmitriGekhtman reviewed Jun 15, 2022

View reviewed changes

ray-operator/apis/ray/v1alpha1/rayservice_types.go Outdated Show resolved Hide resolved

DmitriGekhtman reviewed Jun 15, 2022

View reviewed changes

ray-operator/controllers/ray/common/service.go Outdated Show resolved Hide resolved

DmitriGekhtman reviewed Jun 15, 2022

View reviewed changes

ray-operator/controllers/ray/utils/util.go Outdated Show resolved Hide resolved

DmitriGekhtman reviewed Jun 15, 2022

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

brucez-anyscale requested a review from simon-mo June 17, 2022 06:08

brucez-anyscale added 3 commits June 17, 2022 14:41

update to align with latest serve status

3131765

update

9c09c34

Fix ut and imports

07fde78

DmitriGekhtman reviewed Jun 22, 2022

View reviewed changes

update

beb10b9

DmitriGekhtman reviewed Jun 22, 2022

View reviewed changes

ray-operator/apis/ray/v1alpha1/rayservice_types.go Outdated Show resolved Hide resolved

brucez-anyscale requested a review from DmitriGekhtman June 22, 2022 02:51

update

bbd07fa

DmitriGekhtman reviewed Jun 22, 2022

View reviewed changes

brucez-anyscale added 5 commits June 21, 2022 20:32

address comments

c5c0b2f

update

bd8de73

update delete ray cluster logic

6bc7ab8

update delete ray cluster logic

2c63167

update

38c1ba0

brucez-anyscale requested a review from DmitriGekhtman June 24, 2022 03:13

DmitriGekhtman reviewed Jun 24, 2022

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

simon-mo approved these changes Jun 24, 2022

View reviewed changes

brucez-anyscale added 2 commits June 24, 2022 13:46

Merge branch 'master' into brucez/improveHA

96120fb

address comments

b8b1f76

brucez-anyscale requested a review from DmitriGekhtman June 24, 2022 21:12

DmitriGekhtman reviewed Jun 24, 2022

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

DmitriGekhtman approved these changes Jun 24, 2022

View reviewed changes

update

0cb5879

brucez-anyscale merged commit 0ba6f88 into master Jun 25, 2022

brucez-anyscale commented Nov 14, 2022

View reviewed changes

DmitriGekhtman deleted the brucez/improveHA branch December 3, 2022 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RayService: zero downtime update and healthcheck HA recovery #307

RayService: zero downtime update and healthcheck HA recovery #307

brucez-anyscale commented Jun 14, 2022 •

edited

Loading

DmitriGekhtman commented Jun 15, 2022

brucez-anyscale commented Jun 15, 2022

DmitriGekhtman left a comment

DmitriGekhtman Jun 21, 2022

DmitriGekhtman commented Jun 22, 2022 •

edited

Loading

DmitriGekhtman Jun 22, 2022

brucez-anyscale Jun 22, 2022

simon-mo Jun 24, 2022

simon-mo Jun 24, 2022

brucez-anyscale Jun 24, 2022

DmitriGekhtman Jun 24, 2022 •

edited

Loading

simon-mo Jun 24, 2022

DmitriGekhtman Jun 24, 2022

brucez-anyscale Jun 24, 2022

DmitriGekhtman commented Jun 24, 2022 •

edited

Loading

DmitriGekhtman left a comment

brucez-anyscale Nov 14, 2022

RayService: zero downtime update and healthcheck HA recovery #307

RayService: zero downtime update and healthcheck HA recovery #307

Conversation

brucez-anyscale commented Jun 14, 2022 • edited Loading

Why are these changes needed?

Manual test

Related issue number

Checks

DmitriGekhtman commented Jun 15, 2022

brucez-anyscale commented Jun 15, 2022

DmitriGekhtman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriGekhtman commented Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriGekhtman Jun 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriGekhtman commented Jun 24, 2022 • edited Loading

DmitriGekhtman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brucez-anyscale commented Jun 14, 2022 •

edited

Loading

DmitriGekhtman commented Jun 22, 2022 •

edited

Loading

DmitriGekhtman Jun 24, 2022 •

edited

Loading

DmitriGekhtman commented Jun 24, 2022 •

edited

Loading