Failed to create pod sandbox, networkPlugin cni failed to set up pod <pod-name> network: add cmd: failed to assign an IP address to container when many IPs are present in subnets of EKS VPC #1755

ghost · 2021-11-16T21:11:54Z

@jayanthvn @srini-ram - Apologies if I tagged your name unnecessarily but I am trying to solve this issue for last two weeks. Please provide some pointers

What happened:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e6f10db086b5b2c9475f6a1ce73facd840e680f2438da514e3ef17e8611c1249" network for pod "mlflow-7d6b7f6d5c-tn4b8": networkPlugin cni failed to set up pod "mlflow-7d6b7f6d5c-tn4b8_mlflow" network: add cmd: failed to assign an IP address to container. I checked the number of ip address across all subnets and there are a lot of free IP address

Attach logs

I have downloaded the file generated by sudo bash /opt/cni/bin/aws-cni-support.sh. However it contains the username and password of RDS so I am not attaching it here. Please let me know if I can send to some email ID or which part of this folder can I display here.

What you expected to happen:
FailedCreatePodSandBox should not occur and Liveness and Readiness probes should pass successfully

How to reproduce it (as minimally and precisely as possible):
Execute all the steps in https://gist.github.com/suryakiran1006/eb632e8cd8f26c62c9ff99771a6e9c5f step by step. Once all the security groups are in place, replace mlflow/values.yaml file with values.yaml file attached in the gist and add sg-policy.yaml to mlflow/templates folder

Anything else we need to know?:
Cluster ARN - arn:aws:iam::${ACCOUNT_ID}:role/eksctl-mlflow-demo-cluster-ServiceRole-U511X80JPGW0
Is this with Custom networking - No
How many pods you have on this instance and the type of instance? - 5 pods on m5.xlarge (ondemand-nodegroup) and 4 pods on t3.xlarge (spot-nodegroup)
Kubectl describe o/p of mlflow pod - mlflow-7d6b7f6d5c-tn4b8.txt
Kubectl describe o/p of aws-node - aws-node-xlf8q.txt

Environment:

Kubernetes version (use kubectl version): Client - 1.21; Server - 1.21
CNI Version 1.9
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a): Linux ip-172-31-69-92.ec2.internal 4.14.252-195.483.amzn2.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Mon Nov 1 20:58:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

srini-ram · 2021-11-16T23:50:28Z

@suryakiran1006 - Can you send log files to [email protected] and @achevuru ([email protected]).

achevuru · 2021-11-17T01:29:06Z

@suryakiran1006 I see that you're using SGPP and the Pod did get an IP address. It is failing liveness/readiness probes. You're on 1.21 clusters where the exec timeout is enforced and you're using default 1s timeout for both the probes - you need to adjust them accordingly. Please refer to - #1425 w.r.t liveness/readiness probe failures you're observing for aws-node

  Normal   ResourceAllocated       10m                     vpc-resource-controller  Allocated [{"eniId":"eni-03e05bbb5ddc0a47e","ifAddress":"0e:08:a8:b2:5e:63","privateIp":"192.168.55.135","vlanId":1,"subnetCidr":"192.168.32.0/19"}] to the pod
  Normal   SandboxChanged          10m (x3 over 10m)       kubelet                  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 10m                     kubelet                  Pulling image "larribas/mlflow:1.7.2"
  Normal   Pulled                  10m                     kubelet                  Successfully pulled image "larribas/mlflow:1.7.2" in 7.898958956s
  Warning  Unhealthy               10m                     kubelet                  Readiness probe failed: Get "http://192.168.55.135:80/": dial tcp 192.168.55.135:80: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy               9m39s (x3 over 9m59s)   kubelet                  Liveness probe failed: Get "http://192.168.55.135:80/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

ghost · 2021-11-17T07:23:46Z

@srini-ram sent with subject starting with "Github Issue - 1755"

ghost · 2021-11-17T10:07:30Z

@achevuru As per the instructions given in https://aws.amazon.com/premiumsupport/knowledge-center/eks-cni-plugin-troubleshooting/, aws-node seems to be working fine as number of restarts is 0 even though kubectl describe pods aws-node shows Readiness Probe failed.
I increased the timeoutSeconds for larribas/mlflow:1.7.2 to 999999 but Liveness and Readiness probes are still failing

achevuru · 2021-11-17T17:51:33Z

@suryakiran1006 Have you verified the liveness/readiness probes paths you're using? Are they working? I also see below in the pod spec you shared..

State:          Waiting
Reason:       CrashLoopBackOff

Also, did you check if the application pod/container is actually up and running when the probes are failing? You can manually try to reach out to the probe paths you are using in your spec and see if it is behaving as intended.

ghost · 2021-11-17T18:19:35Z

@achevuru While installing the helm chart without SGPP, everything is working without any problems - even the probes. After implementing SGPP, mlflow pods are failing and I cannot get into them either. After removing the probes, I am not getting any errors and I can even access the RDS from within the probe. However when I remove probes from the deployment, I have noticed that the Events section of mlflow pods shows "<none> " after sometime

As for the aws-node pod, kubectl describe just shows that it is unhealthy for sometime (without any restarts) and then Events section shows "<none>" after sometime. I didn't try getting in aws-node pods

achevuru · 2021-11-17T18:38:11Z

@suryakiran1006 I do see DISABLE_TCP_EARLY_DEMUX is set to true under init container and that is all that should be required for TCP probes to work. Did you try manually reaching out to the probe(s) endpoints with SGPP? Is the request reaching the container? When you say you can't get in to the pods, you mean kubectl exec fails? Did you verify the container is coming up as expected? Maybe there is some dependency during container init which is probably failing due to SGs attached to the pod and it fails to bootstrap all the way. So, it is good to check that the container is coming up to begin with..

ghost · 2021-11-17T18:42:55Z

@achevuru will do and post the findings here

ghost · 2021-11-18T21:56:07Z

@achevuru Update - I can curl the probe endpoints (port 80) from within the pod successfully but I am failing to curl port 5000 which is a port of a service on the cluster. This issue seems eerily similar to #1695. Could this be the reason for probe failure?

The helm manifest looks as following -

# Source: mlflow/templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "mlflow-test-connection"
  labels:
    helm.sh/chart: mlflow-1.0.1
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/version: "1.7.2"
    app.kubernetes.io/managed-by: Helm
  annotations:
    "helm.sh/hook": test-success
spec:
  containers:
    - name: wget
      image: busybox
      command: ['wget']
      args: ['mlflow:5000']
  restartPolicy: Never
MANIFEST:
---
# Source: mlflow/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: mlflow
  labels:
    helm.sh/chart: mlflow-1.0.1
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/version: "1.7.2"
    app.kubernetes.io/managed-by: Helm
spec:
  type: ClusterIP
  ports:
    - port: 5000
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
---
# Source: mlflow/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
  labels:
    helm.sh/chart: mlflow-1.0.1
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/version: "1.7.2"
    app.kubernetes.io/managed-by: Helm
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: mlflow
      app.kubernetes.io/instance: mlflow
  template:
    metadata:
      labels:
        app.kubernetes.io/name: mlflow
        app.kubernetes.io/instance: mlflow
    spec:
      serviceAccountName: mlflow-sa
      securityContext:
        {}
      containers:
        - name: mlflow
          securityContext:
            {}
          image: "public.ecr.aws/n3k4k7j4/k8_mlflow_w_debug_utils:latest"
          imagePullPolicy: Always
          command:
            - mlflow
            - server
          args:
            - --host=0.0.0.0
            - --port=80
            - --default-artifact-root=s3://xxxxxxxxxxxxxxxxxxxx
          ports:
            - name: http
              containerPort: 80
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /
              port: http
            initialDelaySeconds: 600
            periodSeconds: 30
            timeoutSeconds: 600
          readinessProbe:
            httpGet:
              path: /
              port: http           
            initialDelaySeconds: 600
            periodSeconds: 30
            timeoutSeconds: 600
          resources:
            {}
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                - ondemand-nodegroup
---
# Source: mlflow/templates/sg-policy.yaml
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: mlflow
  labels:
    helm.sh/chart: mlflow-1.0.1
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/version: "1.7.2"
    app.kubernetes.io/managed-by: Helm
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: mlflow
      app.kubernetes.io/instance: mlflow
  securityGroups:
    groupIds:
      - sg-xxxxxxxxxxxxxxxx

achevuru · 2021-11-19T05:46:05Z

@suryakiran1006 You mentioned in a previous comment that you were not able to get in to pods with SGPP enabled. But it appears that you are now able to exec in to pods and are able to curl to probe endpoints via localhost. What changed?

As far as curl not working over service port, have you cross checked your SG tied to the pod? Issue you linked shouldn't be related.

ghost · 2021-11-19T06:37:41Z

@achevuru I increased the initialDelaySeconds to 600 seconds which gave me enough time to get into the pod and curl the localhost. Once the 600 seconds are up, kubectl exec terminates with error 137 As for not being able to curl to the service port, you are correct. I wasn't able to curl into it even with a dummy pod that wasn't connected to the security group

ghost · 2021-11-19T17:20:13Z

@achevuru It's working now. I had not enabled inbound traffic from cluster security group to the pod security group at probe ports. Once it was done, the whole setup started working.

Please let me know if you have any questions on this or else please feel free to close this issue. Thank you so much for your help. Much appreciated

For future readers - If you are facing timeout issues with liveness and readiness probes, please check if you have enabled ingress from cluster security group to pod security group ( as given in the important section of https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html#security-groups-pods-deployment)

achevuru · 2021-11-19T18:20:24Z

@suryakiran1006 Good to hear the issue is resolved. Will close this ticket and better to track any other issues that might pop up via new/separate tickets.

irl-segfault · 2021-12-06T03:09:05Z

@suryakiran1006 if this is related to security groups, why were some pods able to come up in your cluster but not the aws-node pods? I just ran into this where aws-node is crashing but nothing regarding security groups has changed recently.

ghost added the bug label Nov 16, 2021

achevuru self-assigned this Nov 19, 2021

achevuru closed this as completed Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to create pod sandbox, networkPlugin cni failed to set up pod <pod-name> network: add cmd: failed to assign an IP address to container when many IPs are present in subnets of EKS VPC #1755

Failed to create pod sandbox, networkPlugin cni failed to set up pod <pod-name> network: add cmd: failed to assign an IP address to container when many IPs are present in subnets of EKS VPC #1755

ghost commented Nov 16, 2021

srini-ram commented Nov 16, 2021

achevuru commented Nov 17, 2021 •

edited

Loading

ghost commented Nov 17, 2021

ghost commented Nov 17, 2021

achevuru commented Nov 17, 2021 •

edited

Loading

ghost commented Nov 17, 2021 •

edited by ghost

Loading

achevuru commented Nov 17, 2021

ghost commented Nov 17, 2021

ghost commented Nov 18, 2021 •

edited by ghost

Loading

achevuru commented Nov 19, 2021

ghost commented Nov 19, 2021 via email •

edited by ghost

Loading

ghost commented Nov 19, 2021 •

edited by ghost

Loading

achevuru commented Nov 19, 2021

irl-segfault commented Dec 6, 2021

Failed to create pod sandbox, networkPlugin cni failed to set up pod <pod-name> network: add cmd: failed to assign an IP address to container when many IPs are present in subnets of EKS VPC #1755

Failed to create pod sandbox, networkPlugin cni failed to set up pod <pod-name> network: add cmd: failed to assign an IP address to container when many IPs are present in subnets of EKS VPC #1755

Comments

ghost commented Nov 16, 2021

srini-ram commented Nov 16, 2021

achevuru commented Nov 17, 2021 • edited Loading

ghost commented Nov 17, 2021

ghost commented Nov 17, 2021

achevuru commented Nov 17, 2021 • edited Loading

ghost commented Nov 17, 2021 • edited by ghost Loading

achevuru commented Nov 17, 2021

ghost commented Nov 17, 2021

ghost commented Nov 18, 2021 • edited by ghost Loading

achevuru commented Nov 19, 2021

ghost commented Nov 19, 2021 via email • edited by ghost Loading

ghost commented Nov 19, 2021 • edited by ghost Loading

achevuru commented Nov 19, 2021

irl-segfault commented Dec 6, 2021

achevuru commented Nov 17, 2021 •

edited

Loading

achevuru commented Nov 17, 2021 •

edited

Loading

ghost commented Nov 17, 2021 •

edited by ghost

Loading

ghost commented Nov 18, 2021 •

edited by ghost

Loading

ghost commented Nov 19, 2021 via email •

edited by ghost

Loading

ghost commented Nov 19, 2021 •

edited by ghost

Loading