Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] flyteadmin Liveness probe is often timing out and triggers a pod restart #5539

Closed
2 tasks done
rambrus opened this issue Jul 4, 2024 · 2 comments
Closed
2 tasks done
Assignees
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers

Comments

@rambrus
Copy link

rambrus commented Jul 4, 2024

Describe the bug

In our PROD environment, flyteadmin pod often gets restarted due to Liveness probe timeout:
image

We host Flyte on EKS (v1.25.16-eks-3af4770) and use the default Readiness and Liveness config for flyteadmin:

       readinessProbe:
          exec:
            command: [ "sh", "-c", "reply=$(curl -s -o /dev/null -w %{http_code} http://127.0.0.1:8088/healthcheck); if [ \"$reply\" -lt 200 -o \"$reply\" -ge 400 ]; then exit 1; fi;","grpc_health_probe", "-addr=:8089"]
          initialDelaySeconds: 15
        livenessProbe:
          exec:
            command: [ "sh", "-c", "reply=$(curl -s -o /dev/null -w %{http_code} http://127.0.0.1:8088/healthcheck); if [ \"$reply\" -lt 200 -o \"$reply\" -ge 400 ]; then exit 1; fi;","grpc_health_probe", "-addr=:8089"]
          initialDelaySeconds: 20
          periodSeconds: 5

We suspect that probe hangs while running the grpc_health_probe, as we have also observed that Kubernetes struggles with the pod termination (in some cases, it takes 10-15 minutes to restart the pod).
I have checked the pod resource usage, it seems fine:
image

We are running Flyte v11.0 with this flyteadmin config:

flyteadmin:
  replicaCount: null
  image:
    tag: v1.11.0-0.0.14
  maxParallelism: 25
  resources:
    limits:
      cpu: 250m
      ephemeral-storage: 5Gi
      memory: 500Mi
    requests:
      cpu: 10m
      ephemeral-storage: 5Gi
      memory: 50Mi
  hpa:
    enabled: true
    minReplicas: 4
  vault:
    ...
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: ml-platform-flyte-flyteadmin
          topologyKey: kubernetes.io/hostname

Any ideas / suggestions are welcome!

Expected behavior

Liveness probe succeeded

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@rambrus rambrus added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jul 4, 2024
Copy link

runllm bot commented Jul 4, 2024

The issue with the flyteadmin liveness probe timing out and triggering pod restarts could be related to the current configuration of the liveness probe. The default configuration uses both an HTTP health check and a gRPC health probe. This combined check might be causing the timeouts.

According to the flyte_helm_generated.yaml, the liveness probe is configured as follows:

livenessProbe:
  exec:
    command: [ "sh", "-c", "reply=$(curl -s -o /dev/null -w %{http_code} http://127.0.0.1:8088/healthcheck); if [ \"$reply\" -lt 200 -o \"$reply\" -ge 400 ]; then exit 1; fi;","grpc_health_probe", "-addr=:8089"]
  initialDelaySeconds: 20
  periodSeconds: 5

You might consider simplifying the liveness probe to use only the HTTP health check or increasing the initialDelaySeconds and periodSeconds to give the probe more time to succeed. Here is an example of a simplified liveness probe:

livenessProbe:
  exec:
    command: [ "sh", "-c", "reply=$(curl -s -o /dev/null -w %{http_code} http://127.0.0.1:8088/healthcheck); if [ \"$reply\" -lt 200 -o \"$reply\" -ge 400 ]; then exit 1; fi;"]
  initialDelaySeconds: 30
  periodSeconds: 10

Would you like more details on how to implement this change or further troubleshooting steps?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@kumare3
Copy link
Contributor

kumare3 commented Jul 4, 2024

@rambrus your problem is the way you have allocated resoruces to flyteadmin

resources:
    limits:
      cpu: 250m
      ephemeral-storage: 5Gi
      memory: 500Mi
    requests:
      cpu: 10m
      ephemeral-storage: 5Gi
      memory: 50Mi

you have 0.25cpu max and 0.1 cpu in and 50mb of memory. this is unacceptable for a production deployment for a company like Expedia :).

Please give it 2 cpus and 2-4Gbs atleast.

cc @davidmirror-ops maybe we should add a comment to the helm chart resources section, to please adjust this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers
Projects
Status: Done
Development

No branches or pull requests

3 participants