-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to create pod sandbox, networkPlugin cni failed to set up pod <pod-name> network: add cmd: failed to assign an IP address to container when many IPs are present in subnets of EKS VPC #1755
Comments
@suryakiran1006 - Can you send log files to [email protected] and @achevuru ([email protected]). |
@suryakiran1006 I see that you're using SGPP and the Pod did get an IP address. It is failing liveness/readiness probes. You're on 1.21 clusters where the exec timeout is enforced and you're using default
|
@srini-ram sent with subject starting with "Github Issue - 1755" |
@achevuru As per the instructions given in https://aws.amazon.com/premiumsupport/knowledge-center/eks-cni-plugin-troubleshooting/, aws-node seems to be working fine as number of restarts is 0 even though kubectl describe pods aws-node shows Readiness Probe failed. |
@suryakiran1006 Have you verified the liveness/readiness probes paths you're using? Are they working? I also see below in the pod spec you shared..
Also, did you check if the application pod/container is actually up and running when the probes are failing? You can manually try to reach out to the probe paths you are using in your spec and see if it is behaving as intended. |
@achevuru While installing the helm chart without SGPP, everything is working without any problems - even the probes. After implementing SGPP, mlflow pods are failing and I cannot get into them either. After removing the probes, I am not getting any errors and I can even access the RDS from within the probe. However when I remove probes from the deployment, I have noticed that the Events section of mlflow pods shows "<none> " after sometime As for the aws-node pod, kubectl describe just shows that it is unhealthy for sometime (without any restarts) and then Events section shows "<none>" after sometime. I didn't try getting in aws-node pods |
@suryakiran1006 I do see |
@achevuru will do and post the findings here |
@achevuru Update - I can curl the probe endpoints (port 80) from within the pod successfully but I am failing to curl port 5000 which is a port of a service on the cluster. This issue seems eerily similar to #1695. Could this be the reason for probe failure? The helm manifest looks as following - # Source: mlflow/templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
name: "mlflow-test-connection"
labels:
helm.sh/chart: mlflow-1.0.1
app.kubernetes.io/name: mlflow
app.kubernetes.io/instance: mlflow
app.kubernetes.io/version: "1.7.2"
app.kubernetes.io/managed-by: Helm
annotations:
"helm.sh/hook": test-success
spec:
containers:
- name: wget
image: busybox
command: ['wget']
args: ['mlflow:5000']
restartPolicy: Never
MANIFEST:
---
# Source: mlflow/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: mlflow
labels:
helm.sh/chart: mlflow-1.0.1
app.kubernetes.io/name: mlflow
app.kubernetes.io/instance: mlflow
app.kubernetes.io/version: "1.7.2"
app.kubernetes.io/managed-by: Helm
spec:
type: ClusterIP
ports:
- port: 5000
targetPort: http
protocol: TCP
name: http
selector:
app.kubernetes.io/name: mlflow
app.kubernetes.io/instance: mlflow
---
# Source: mlflow/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow
labels:
helm.sh/chart: mlflow-1.0.1
app.kubernetes.io/name: mlflow
app.kubernetes.io/instance: mlflow
app.kubernetes.io/version: "1.7.2"
app.kubernetes.io/managed-by: Helm
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: mlflow
app.kubernetes.io/instance: mlflow
template:
metadata:
labels:
app.kubernetes.io/name: mlflow
app.kubernetes.io/instance: mlflow
spec:
serviceAccountName: mlflow-sa
securityContext:
{}
containers:
- name: mlflow
securityContext:
{}
image: "public.ecr.aws/n3k4k7j4/k8_mlflow_w_debug_utils:latest"
imagePullPolicy: Always
command:
- mlflow
- server
args:
- --host=0.0.0.0
- --port=80
- --default-artifact-root=s3://xxxxxxxxxxxxxxxxxxxx
ports:
- name: http
containerPort: 80
protocol: TCP
livenessProbe:
httpGet:
path: /
port: http
initialDelaySeconds: 600
periodSeconds: 30
timeoutSeconds: 600
readinessProbe:
httpGet:
path: /
port: http
initialDelaySeconds: 600
periodSeconds: 30
timeoutSeconds: 600
resources:
{}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: eks.amazonaws.com/nodegroup
operator: In
values:
- ondemand-nodegroup
---
# Source: mlflow/templates/sg-policy.yaml
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
name: mlflow
labels:
helm.sh/chart: mlflow-1.0.1
app.kubernetes.io/name: mlflow
app.kubernetes.io/instance: mlflow
app.kubernetes.io/version: "1.7.2"
app.kubernetes.io/managed-by: Helm
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: mlflow
app.kubernetes.io/instance: mlflow
securityGroups:
groupIds:
- sg-xxxxxxxxxxxxxxxx |
@suryakiran1006 You mentioned in a previous comment that you were not able to get in to pods with SGPP enabled. But it appears that you are now able to exec in to pods and are able to curl to probe endpoints via localhost. What changed? As far as curl not working over service port, have you cross checked your SG tied to the pod? Issue you linked shouldn't be related. |
@achevuru I increased the initialDelaySeconds to 600 seconds which gave me enough time to get into the pod and curl the localhost. Once the 600 seconds are up, kubectl exec terminates with error 137
As for not being able to curl to the service port, you are correct. I wasn't able to curl into it even with a dummy pod that wasn't connected to the security group
|
@achevuru It's working now. I had not enabled inbound traffic from cluster security group to the pod security group at probe ports. Once it was done, the whole setup started working. Please let me know if you have any questions on this or else please feel free to close this issue. Thank you so much for your help. Much appreciated For future readers - If you are facing timeout issues with liveness and readiness probes, please check if you have enabled ingress from cluster security group to pod security group ( as given in the |
@suryakiran1006 Good to hear the issue is resolved. Will close this ticket and better to track any other issues that might pop up via new/separate tickets. |
@suryakiran1006 if this is related to security groups, why were some pods able to come up in your cluster but not the aws-node pods? I just ran into this where aws-node is crashing but nothing regarding security groups has changed recently. |
@jayanthvn @srini-ram - Apologies if I tagged your name unnecessarily but I am trying to solve this issue for last two weeks. Please provide some pointers
What happened:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e6f10db086b5b2c9475f6a1ce73facd840e680f2438da514e3ef17e8611c1249" network for pod "mlflow-7d6b7f6d5c-tn4b8": networkPlugin cni failed to set up pod "mlflow-7d6b7f6d5c-tn4b8_mlflow" network: add cmd: failed to assign an IP address to container. I checked the number of ip address across all subnets and there are a lot of free IP address
Attach logs
I have downloaded the file generated by sudo bash /opt/cni/bin/aws-cni-support.sh. However it contains the username and password of RDS so I am not attaching it here. Please let me know if I can send to some email ID or which part of this folder can I display here.
What you expected to happen:
FailedCreatePodSandBox should not occur and Liveness and Readiness probes should pass successfully
How to reproduce it (as minimally and precisely as possible):
Execute all the steps in https://gist.github.com/suryakiran1006/eb632e8cd8f26c62c9ff99771a6e9c5f step by step. Once all the security groups are in place, replace mlflow/values.yaml file with values.yaml file attached in the gist and add sg-policy.yaml to mlflow/templates folder
Anything else we need to know?:
Cluster ARN - arn:aws:iam::${ACCOUNT_ID}:role/eksctl-mlflow-demo-cluster-ServiceRole-U511X80JPGW0
Is this with Custom networking - No
How many pods you have on this instance and the type of instance? - 5 pods on m5.xlarge (ondemand-nodegroup) and 4 pods on t3.xlarge (spot-nodegroup)
Kubectl describe o/p of mlflow pod - mlflow-7d6b7f6d5c-tn4b8.txt
Kubectl describe o/p of aws-node - aws-node-xlf8q.txt
Environment:
kubectl version
): Client - 1.21; Server - 1.21cat /etc/os-release
): Amazon Linux 2uname -a
): Linux ip-172-31-69-92.ec2.internal 4.14.252-195.483.amzn2.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Mon Nov 1 20:58:46 UTC 2021 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: