-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RayService: zero downtime update and healthcheck HA recovery #307
Conversation
PR description, please :) |
Still under draft mode, want to collect some early feedback, if you have time. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, mostly questions/discussion.
Didn't quite make it through to reading the entire PR.
These changes look pretty involved -- could you explain the overall control flow a bit more in the PR description?
// Pending Service Status indicates a RayCluster will be created or is under creating. | ||
PendingServiceStatus RayServiceStatus `json:"pendingServiceStatus,omitempty"` | ||
// ServiceStatus indicates the current RayService status. | ||
ServiceStatus ServiceStatus `json:"serviceStatus,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ServiceStatus
sounds very similar to RayServiceStatus
, which is a little confusing. Not sure how to resolve.
(Might have better suggestions when I read through the rest of the PR.)
We'll eventually also want to explain that in detail in the docs. |
|
||
// GenerateRayClusterName generates a ray cluster name from ray service name | ||
func GenerateRayClusterName(serviceName string) string { | ||
return fmt.Sprintf("%s%s%s", serviceName, RayClusterSuffix, rand.String(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly concerned about naming collisions due to the truncation that happens all over the place
func CheckName(s string) string { |
but maybe this is unlikely to cause issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is a point we should be careful.
Owns(&rayv1alpha1.RayCluster{}). | ||
Owns(&corev1.Service{}). | ||
Owns(&networkingv1.Ingress{}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this stream service and ingress update into controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also what about the configmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have configmap. If service or ingress updates, the controller will know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that we want to detect, say, if someone deletes the RayService's K8s Service.
There's no config map :)
// BuildIngressForRayService Builds the ingress for head service dashboard for RayService. | ||
// This is used to expose dashboard for external traffic. | ||
// RayService controller updates the ingress whenever a new RayCluster serves the traffic. | ||
func BuildIngressForRayService(service rayiov1alpha1.RayService, cluster rayiov1alpha1.RayCluster) (*networkingv1.Ingress, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need an ingress here? Just a Service resource is useful enough to call the port both internally and externally. We should expect users to build their own ingress right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea was to match the behavior of the RayCluster controller, which does have the ability to make an ingress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it is the same as RayCluster as Dmitri said.
I think it looks good overall. Could you add even more function descriptions -- for example, if there are functions called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
serveStatuses.ApplicationStatus.LastUpdateTime = &timeNow | ||
serveStatuses.ApplicationStatus.HealthLastUpdateTime = &timeNow | ||
if serveStatuses.ApplicationStatus.Status != "HEALTHY" { | ||
// Check previously app status. | ||
if rayServiceServeStatus.ApplicationStatus.Status != "HEALTHY" { | ||
serveStatuses.ApplicationStatus.HealthLastUpdateTime = rayServiceServeStatus.ApplicationStatus.HealthLastUpdateTime | ||
|
||
if rayServiceServeStatus.ApplicationStatus.HealthLastUpdateTime != nil && time.Since(rayServiceServeStatus.ApplicationStatus.HealthLastUpdateTime.Time).Seconds() > ServeDeploymentUnhealthySecondThreshold { | ||
isHealthy = false | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shrekris-anyscale Thanks for the discussion. Here is the bug code.
…ject#307) * draft for ha * import fmt * debug ingress * Draft service * update * fix * Update service logic * update * update * Logs * update * debug * Update * Update * Update * update * Fix cluster start flaky issue * update * Update service and ingress * update rbac * Draft v1 * Update * address comments * Address comments and refactor codes * update * Fix lint issue * update * Fix unit tests * goImport * Update unit tests * Implement unit tests * Change preparing to pending * goimports * update * Improve the pr to show both statuses * Improve the pr to show both statuses * update to align with latest serve status * update * Fix ut and imports * update * update * address comments * update * update delete ray cluster logic * update delete ray cluster logic * update * address comments * update
Why are these changes needed?
This pr supports:
Design:
Whenever config updated or ray cluster unhealthy for a certain time, create a new ray cluster and send serve deployment request. If the new cluster is ready for serving, switch the traffic by updating ingress and service.
The reconciler follows:
Manual test
Test in EKS with RayCluster config update
Status snapshot
Related issue number
Checks