You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We suspect that probe hangs while running the grpc_health_probe, as we have also observed that Kubernetes struggles with the pod termination (in some cases, it takes 10-15 minutes to restart the pod).
I have checked the pod resource usage, it seems fine:
We are running Flyte v11.0 with this flyteadmin config:
The issue with the flyteadmin liveness probe timing out and triggering pod restarts could be related to the current configuration of the liveness probe. The default configuration uses both an HTTP health check and a gRPC health probe. This combined check might be causing the timeouts.
You might consider simplifying the liveness probe to use only the HTTP health check or increasing the initialDelaySeconds and periodSeconds to give the probe more time to succeed. Here is an example of a simplified liveness probe:
Describe the bug
In our PROD environment,
flyteadmin
pod often gets restarted due to Liveness probe timeout:We host Flyte on EKS (v1.25.16-eks-3af4770) and use the default Readiness and Liveness config for
flyteadmin
:We suspect that probe hangs while running the grpc_health_probe, as we have also observed that Kubernetes struggles with the pod termination (in some cases, it takes 10-15 minutes to restart the pod).
I have checked the pod resource usage, it seems fine:
We are running Flyte v11.0 with this flyteadmin config:
Any ideas / suggestions are welcome!
Expected behavior
Liveness probe succeeded
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: