Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI not removing network built on a node after IP is lost externally and IPAMD reconciles this state #2834

Closed
AbeOwlu opened this issue Mar 8, 2024 · 7 comments
Labels
question stale Issue or PR is stale

Comments

@AbeOwlu
Copy link

AbeOwlu commented Mar 8, 2024

IPAM reconciliation:
Scenario;

  • Pod is created and assigned an IP, 10.0.2.99
  • the IP after complete sandbox initialization is reclaimed by an automation in the network external to the cluster
  • the IPAMD logs show an IP pool reconcile that catches this lost IP and reconciles its cache calling EC2 endpoint
  • the network route for this pod with IP 10.0.2.99 remains unchanged on the local node however, other node peers are no longer able to reach this pod on 10.0.2.99 of its host nodes, it is reachable from this local host and kubernetes liveness probes are succeeding - keeping an unhealthy pod in the cluster

{"level":"debug","ts":"2024-03-08T18:10:50.378Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"liveness-http\" K8S_POD_NAMESPACE:\"gateway-ns\" K8S_POD_INFRA_CONTAINER_ID:\"7f92409d45a01365839f5db2b7c30c35626c1de02779233046bf5c1bd2c59380\" ContainerID:\"7f92409d45a01365839f5db2b7c30c35626c1de02779233046bf5c1bd2c59380\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-d4e752dc-bdf7-f594-2a1a-38dfa2445dfb\""}

{"level":"info","ts":"2024-03-08T18:10:50.378Z","caller":"datastore/data_store.go:750","msg":"AssignPodIPv4Address: Assign IP 10.0.2.99 to sandbox aws-cni/7f92409d45a01365839f5db2b7c30c35626c1de02779233046bf5c1bd2c59380/eth0"}

Externl automation event Event time
March 08, 2024, 18:11:25 (UTC+00:00) UnassignPrivateIpAddresses  "privateIpAddress": "10.0.2.99"

{"level":"warn","ts":"2024-03-08T18:12:00.256Z","caller":"ipamd/ipamd.go:1404","msg":"Instance metadata does not match data store! ipPool: [10.0.2.99 10.0.2.27 10.0.2.158], metadata: [{\n  Primary: true,\n  PrivateIpAddress: \"10.0.2.149\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"10.0.2.27\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"10.0.2.158\"\n}]"}

{"level":"info","ts":"2024-03-08T18:12:00.334Z","caller":"datastore/data_store.go:578","msg":"UnAssignPodIPAddress: Unassign IP 10.0.2.99 from sandbox aws-cni/7f92409d45a01365839f5db2b7c30c35626c1de02779233046bf5c1bd2c59380/eth0"}

What you expected to happen:

  • After event "UnAssignPodIPAddress: Unassign IP 10.0.2.99 from sandbox aws-cni/7f9240... the CNI is triggered to tear down the network route with this IP, and liveness probe may eventually fail and attempt to heal this pod.

How to reproduce it (as minimally and precisely as possible):

  • create pod with liveness and readiness probe, like;
apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness3
  name: liveness-http3
spec:
  containers:
  - name: ngo-proxy
    image: gcr.io/google_containers/echoserver:1.4
    # args:
    # - /server
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        # httpHeaders:
        # - name: Custom-Header
        #   value: Awesome
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
      # initialDelaySeconds: 50
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 2
  restartPolicy: Always
  • remove the IP from the node this pod is scheduled at any time

Anything else we need to know?:

  • during the sweep phase of the nodeIPPoolReconcile process, should the CNI be invoked to updateHostNetwork for the removed IPs?
  • see issue

Environment:

  • Kubernetes version (use kubectl version):
  • CNI Version: image: 602401143452.dkr.ecr.us-west-1.amazonaws.com/amazon-k8s-cni-init:v1.15.3-eksbuild.1
  • OS (e.g: cat /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Kernel (e.g. uname -a):
Linux ....compute.internal 5.10.198-187.748.amzn2.x86_64 #1 SMP Tue Oct 24 19:49:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
@AbeOwlu AbeOwlu added the bug label Mar 8, 2024
@jdn5126
Copy link
Contributor

jdn5126 commented Mar 8, 2024

@AbeOwlu what is this "external event" that reclaims an IP on an ENI? Only the IPAM daemon should be assigning and unassigning IPs to an ENI. Before calling the EC2 API to unassign IPs, it removes those IPs from the datastore. That precondition is required to avoid this exact scenario

@jdn5126 jdn5126 added question and removed bug labels Mar 8, 2024
@AbeOwlu
Copy link
Author

AbeOwlu commented Mar 13, 2024

There's an automation pipeline that's incorrectly, (I might add) seeing a drift in the VPC network and unassigns an IP from an EC2 instance at the moment.

  • looking into this further, it actually appears to show the CRI attempting to recreate container sandbox, but the CNI was not not responsive.. connection refused on the 3 attempts so the orchestrator may may be handling this case.

Will update with more details and logs...

@GnatorX
Copy link
Contributor

GnatorX commented Apr 19, 2024

I think I hit this issue too. Let me circle back with some more info

@GnatorX
Copy link
Contributor

GnatorX commented May 31, 2024

We had this issue. aws/amazon-vpc-resource-controller-k8s#412 which deleted branch ENI from pods. CNI didn't do anything about the missing network interface or lost IP address

@orsenthil
Copy link
Member

@AbeOwlu - CNI will not remove any interface that doesn't manage. For any external changes introduced to the interfaces that CNI manages, if they are not in use, it will garbage collect them. If it didn't happen, and you can reproduce this as bug, let us know. Otherwise, we can close this ticket.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Aug 26, 2024
Copy link

github-actions bot commented Sep 9, 2024

Issue closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question stale Issue or PR is stale
Projects
None yet
Development

No branches or pull requests

4 participants