Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to create pod sandbox, networkPlugin cni failed to set up pod <pod-name> network: add cmd: failed to assign an IP address to container when many IPs are present in subnets of EKS VPC #1755

Closed
ghost opened this issue Nov 16, 2021 · 14 comments
Assignees
Labels

Comments

@ghost
Copy link

ghost commented Nov 16, 2021

@jayanthvn @srini-ram - Apologies if I tagged your name unnecessarily but I am trying to solve this issue for last two weeks. Please provide some pointers

What happened:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e6f10db086b5b2c9475f6a1ce73facd840e680f2438da514e3ef17e8611c1249" network for pod "mlflow-7d6b7f6d5c-tn4b8": networkPlugin cni failed to set up pod "mlflow-7d6b7f6d5c-tn4b8_mlflow" network: add cmd: failed to assign an IP address to container. I checked the number of ip address across all subnets and there are a lot of free IP address

Attach logs

I have downloaded the file generated by sudo bash /opt/cni/bin/aws-cni-support.sh. However it contains the username and password of RDS so I am not attaching it here. Please let me know if I can send to some email ID or which part of this folder can I display here.

What you expected to happen:
FailedCreatePodSandBox should not occur and Liveness and Readiness probes should pass successfully

How to reproduce it (as minimally and precisely as possible):
Execute all the steps in https://gist.github.com/suryakiran1006/eb632e8cd8f26c62c9ff99771a6e9c5f step by step. Once all the security groups are in place, replace mlflow/values.yaml file with values.yaml file attached in the gist and add sg-policy.yaml to mlflow/templates folder

Anything else we need to know?:
Cluster ARN - arn:aws:iam::${ACCOUNT_ID}:role/eksctl-mlflow-demo-cluster-ServiceRole-U511X80JPGW0
Is this with Custom networking - No
How many pods you have on this instance and the type of instance? - 5 pods on m5.xlarge (ondemand-nodegroup) and 4 pods on t3.xlarge (spot-nodegroup)
Kubectl describe o/p of mlflow pod - mlflow-7d6b7f6d5c-tn4b8.txt
Kubectl describe o/p of aws-node - aws-node-xlf8q.txt

Environment:

  • Kubernetes version (use kubectl version): Client - 1.21; Server - 1.21
  • CNI Version 1.9
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): Linux ip-172-31-69-92.ec2.internal 4.14.252-195.483.amzn2.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Mon Nov 1 20:58:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
@ghost ghost added the bug label Nov 16, 2021
@srini-ram
Copy link
Contributor

@suryakiran1006 - Can you send log files to [email protected] and @achevuru ([email protected]).

@achevuru
Copy link
Contributor

achevuru commented Nov 17, 2021

@suryakiran1006 I see that you're using SGPP and the Pod did get an IP address. It is failing liveness/readiness probes. You're on 1.21 clusters where the exec timeout is enforced and you're using default 1s timeout for both the probes - you need to adjust them accordingly. Please refer to - #1425 w.r.t liveness/readiness probe failures you're observing for aws-node

  Normal   ResourceAllocated       10m                     vpc-resource-controller  Allocated [{"eniId":"eni-03e05bbb5ddc0a47e","ifAddress":"0e:08:a8:b2:5e:63","privateIp":"192.168.55.135","vlanId":1,"subnetCidr":"192.168.32.0/19"}] to the pod
  Normal   SandboxChanged          10m (x3 over 10m)       kubelet                  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 10m                     kubelet                  Pulling image "larribas/mlflow:1.7.2"
  Normal   Pulled                  10m                     kubelet                  Successfully pulled image "larribas/mlflow:1.7.2" in 7.898958956s
  Warning  Unhealthy               10m                     kubelet                  Readiness probe failed: Get "http://192.168.55.135:80/": dial tcp 192.168.55.135:80: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy               9m39s (x3 over 9m59s)   kubelet                  Liveness probe failed: Get "http://192.168.55.135:80/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

@ghost
Copy link
Author

ghost commented Nov 17, 2021

@srini-ram sent with subject starting with "Github Issue - 1755"

@ghost
Copy link
Author

ghost commented Nov 17, 2021

@achevuru As per the instructions given in https://aws.amazon.com/premiumsupport/knowledge-center/eks-cni-plugin-troubleshooting/, aws-node seems to be working fine as number of restarts is 0 even though kubectl describe pods aws-node shows Readiness Probe failed.
I increased the timeoutSeconds for larribas/mlflow:1.7.2 to 999999 but Liveness and Readiness probes are still failing

@achevuru
Copy link
Contributor

achevuru commented Nov 17, 2021

@suryakiran1006 Have you verified the liveness/readiness probes paths you're using? Are they working? I also see below in the pod spec you shared..

State:          Waiting
Reason:       CrashLoopBackOff

Also, did you check if the application pod/container is actually up and running when the probes are failing? You can manually try to reach out to the probe paths you are using in your spec and see if it is behaving as intended.

@ghost
Copy link
Author

ghost commented Nov 17, 2021

@achevuru While installing the helm chart without SGPP, everything is working without any problems - even the probes. After implementing SGPP, mlflow pods are failing and I cannot get into them either. After removing the probes, I am not getting any errors and I can even access the RDS from within the probe. However when I remove probes from the deployment, I have noticed that the Events section of mlflow pods shows "<none> " after sometime

As for the aws-node pod, kubectl describe just shows that it is unhealthy for sometime (without any restarts) and then Events section shows "<none>" after sometime. I didn't try getting in aws-node pods

@achevuru
Copy link
Contributor

@suryakiran1006 I do see DISABLE_TCP_EARLY_DEMUX is set to true under init container and that is all that should be required for TCP probes to work. Did you try manually reaching out to the probe(s) endpoints with SGPP? Is the request reaching the container? When you say you can't get in to the pods, you mean kubectl exec fails? Did you verify the container is coming up as expected? Maybe there is some dependency during container init which is probably failing due to SGs attached to the pod and it fails to bootstrap all the way. So, it is good to check that the container is coming up to begin with..

@ghost
Copy link
Author

ghost commented Nov 17, 2021

@achevuru will do and post the findings here

@ghost
Copy link
Author

ghost commented Nov 18, 2021

@achevuru Update - I can curl the probe endpoints (port 80) from within the pod successfully but I am failing to curl port 5000 which is a port of a service on the cluster. This issue seems eerily similar to #1695. Could this be the reason for probe failure?

The helm manifest looks as following -

# Source: mlflow/templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "mlflow-test-connection"
  labels:
    helm.sh/chart: mlflow-1.0.1
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/version: "1.7.2"
    app.kubernetes.io/managed-by: Helm
  annotations:
    "helm.sh/hook": test-success
spec:
  containers:
    - name: wget
      image: busybox
      command: ['wget']
      args: ['mlflow:5000']
  restartPolicy: Never
MANIFEST:
---
# Source: mlflow/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: mlflow
  labels:
    helm.sh/chart: mlflow-1.0.1
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/version: "1.7.2"
    app.kubernetes.io/managed-by: Helm
spec:
  type: ClusterIP
  ports:
    - port: 5000
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
---
# Source: mlflow/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
  labels:
    helm.sh/chart: mlflow-1.0.1
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/version: "1.7.2"
    app.kubernetes.io/managed-by: Helm
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: mlflow
      app.kubernetes.io/instance: mlflow
  template:
    metadata:
      labels:
        app.kubernetes.io/name: mlflow
        app.kubernetes.io/instance: mlflow
    spec:
      serviceAccountName: mlflow-sa
      securityContext:
        {}
      containers:
        - name: mlflow
          securityContext:
            {}
          image: "public.ecr.aws/n3k4k7j4/k8_mlflow_w_debug_utils:latest"
          imagePullPolicy: Always
          command:
            - mlflow
            - server
          args:
            - --host=0.0.0.0
            - --port=80
            - --default-artifact-root=s3://xxxxxxxxxxxxxxxxxxxx
          ports:
            - name: http
              containerPort: 80
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /
              port: http
            initialDelaySeconds: 600
            periodSeconds: 30
            timeoutSeconds: 600
          readinessProbe:
            httpGet:
              path: /
              port: http           
            initialDelaySeconds: 600
            periodSeconds: 30
            timeoutSeconds: 600
          resources:
            {}
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                - ondemand-nodegroup
---
# Source: mlflow/templates/sg-policy.yaml
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: mlflow
  labels:
    helm.sh/chart: mlflow-1.0.1
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/version: "1.7.2"
    app.kubernetes.io/managed-by: Helm
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: mlflow
      app.kubernetes.io/instance: mlflow
  securityGroups:
    groupIds:
      - sg-xxxxxxxxxxxxxxxx

@achevuru achevuru self-assigned this Nov 19, 2021
@achevuru
Copy link
Contributor

@suryakiran1006 You mentioned in a previous comment that you were not able to get in to pods with SGPP enabled. But it appears that you are now able to exec in to pods and are able to curl to probe endpoints via localhost. What changed?

As far as curl not working over service port, have you cross checked your SG tied to the pod? Issue you linked shouldn't be related.

@ghost
Copy link
Author

ghost commented Nov 19, 2021 via email

@ghost
Copy link
Author

ghost commented Nov 19, 2021

@achevuru It's working now. I had not enabled inbound traffic from cluster security group to the pod security group at probe ports. Once it was done, the whole setup started working.

Please let me know if you have any questions on this or else please feel free to close this issue. Thank you so much for your help. Much appreciated

For future readers - If you are facing timeout issues with liveness and readiness probes, please check if you have enabled ingress from cluster security group to pod security group ( as given in the important section of https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html#security-groups-pods-deployment)

@achevuru
Copy link
Contributor

@suryakiran1006 Good to hear the issue is resolved. Will close this ticket and better to track any other issues that might pop up via new/separate tickets.

@irl-segfault
Copy link

@suryakiran1006 if this is related to security groups, why were some pods able to come up in your cluster but not the aws-node pods? I just ran into this where aws-node is crashing but nothing regarding security groups has changed recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants