[BUG] flyteadmin Liveness probe is often timing out and triggers a pod restart #5539

rambrus · 2024-07-04T15:20:37Z

Describe the bug

In our PROD environment, flyteadmin pod often gets restarted due to Liveness probe timeout:

We host Flyte on EKS (v1.25.16-eks-3af4770) and use the default Readiness and Liveness config for flyteadmin:

       readinessProbe:
          exec:
            command: [ "sh", "-c", "reply=$(curl -s -o /dev/null -w %{http_code} http://127.0.0.1:8088/healthcheck); if [ \"$reply\" -lt 200 -o \"$reply\" -ge 400 ]; then exit 1; fi;","grpc_health_probe", "-addr=:8089"]
          initialDelaySeconds: 15
        livenessProbe:
          exec:
            command: [ "sh", "-c", "reply=$(curl -s -o /dev/null -w %{http_code} http://127.0.0.1:8088/healthcheck); if [ \"$reply\" -lt 200 -o \"$reply\" -ge 400 ]; then exit 1; fi;","grpc_health_probe", "-addr=:8089"]
          initialDelaySeconds: 20
          periodSeconds: 5

We suspect that probe hangs while running the grpc_health_probe, as we have also observed that Kubernetes struggles with the pod termination (in some cases, it takes 10-15 minutes to restart the pod).
I have checked the pod resource usage, it seems fine:

We are running Flyte v11.0 with this flyteadmin config:

flyteadmin:
  replicaCount: null
  image:
    tag: v1.11.0-0.0.14
  maxParallelism: 25
  resources:
    limits:
      cpu: 250m
      ephemeral-storage: 5Gi
      memory: 500Mi
    requests:
      cpu: 10m
      ephemeral-storage: 5Gi
      memory: 50Mi
  hpa:
    enabled: true
    minReplicas: 4
  vault:
    ...
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: ml-platform-flyte-flyteadmin
          topologyKey: kubernetes.io/hostname

Any ideas / suggestions are welcome!

Expected behavior

Liveness probe succeeded

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

runllm · 2024-07-04T15:21:00Z

The issue with the flyteadmin liveness probe timing out and triggering pod restarts could be related to the current configuration of the liveness probe. The default configuration uses both an HTTP health check and a gRPC health probe. This combined check might be causing the timeouts.

According to the flyte_helm_generated.yaml, the liveness probe is configured as follows:

livenessProbe:
  exec:
    command: [ "sh", "-c", "reply=$(curl -s -o /dev/null -w %{http_code} http://127.0.0.1:8088/healthcheck); if [ \"$reply\" -lt 200 -o \"$reply\" -ge 400 ]; then exit 1; fi;","grpc_health_probe", "-addr=:8089"]
  initialDelaySeconds: 20
  periodSeconds: 5

You might consider simplifying the liveness probe to use only the HTTP health check or increasing the initialDelaySeconds and periodSeconds to give the probe more time to succeed. Here is an example of a simplified liveness probe:

livenessProbe:
  exec:
    command: [ "sh", "-c", "reply=$(curl -s -o /dev/null -w %{http_code} http://127.0.0.1:8088/healthcheck); if [ \"$reply\" -lt 200 -o \"$reply\" -ge 400 ]; then exit 1; fi;"]
  initialDelaySeconds: 30
  periodSeconds: 10

Would you like more details on how to implement this change or further troubleshooting steps?

Sources:

flyte_helm_generated.yaml

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

kumare3 · 2024-07-04T16:21:24Z

@rambrus your problem is the way you have allocated resoruces to flyteadmin

resources:
    limits:
      cpu: 250m
      ephemeral-storage: 5Gi
      memory: 500Mi
    requests:
      cpu: 10m
      ephemeral-storage: 5Gi
      memory: 50Mi

you have 0.25cpu max and 0.1 cpu in and 50mb of memory. this is unacceptable for a production deployment for a company like Expedia :).

Please give it 2 cpus and 2-4Gbs atleast.

cc @davidmirror-ops maybe we should add a comment to the helm chart resources section, to please adjust this

rambrus added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jul 4, 2024

github-project-automation bot added this to Flyte Issues/PRs maintenance Jul 4, 2024

github-project-automation bot moved this to Backlog in Flyte Issues/PRs maintenance Jul 4, 2024

kumare3 closed this as completed Jul 4, 2024

github-project-automation bot moved this from Backlog to Done in Flyte Issues/PRs maintenance Jul 4, 2024

davidmirror-ops self-assigned this Jul 25, 2024

runllm bot mentioned this issue Jul 30, 2024

[BUG] FlytePropeller 1.13.0 is crashing Flyteadmin and making the overall service in a bad state #5606

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] flyteadmin Liveness probe is often timing out and triggers a pod restart #5539

[BUG] flyteadmin Liveness probe is often timing out and triggers a pod restart #5539

rambrus commented Jul 4, 2024

runllm bot commented Jul 4, 2024

kumare3 commented Jul 4, 2024

[BUG] flyteadmin Liveness probe is often timing out and triggers a pod restart #5539

[BUG] flyteadmin Liveness probe is often timing out and triggers a pod restart #5539

Comments

rambrus commented Jul 4, 2024

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

runllm bot commented Jul 4, 2024

kumare3 commented Jul 4, 2024