Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle pod delete with non-existent IPs #44

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ashish-billore
Copy link

This patch makes whereabouts pods deployed as ds to use
hostNetwork by default. This is helpful as CNI and core
infra components use hostnetwork and hence avoid issues
such as fault in CNI config etc and also making debugging
much easier.

@ashish-billore
Copy link
Author

ashish-billore commented Jul 9, 2020

@dougbtv @crandles @nicklesimba @aneeshkp Please have a look and give your opinion.
Thanks!

@@ -79,8 +79,12 @@ func cmdDel(args *skel.CmdArgs) error {

_, err = storage.IPManagement(types.Deallocate, *ipamConf, args.ContainerID)
if err != nil {
logging.Errorf("Error deallocating IP: %s", err)
return fmt.Errorf("Error deallocating IP: %s", err)
if (strings.Contains(fmt.Sprintf("%s", err), "Did not find reserved IP for container")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would opt for returning a type we can compare against instead of string comparison: https://github.com/dougbtv/whereabouts/blob/d879bd867ede305511260028ae1c4c2f049271f3/pkg/allocate/allocate.go#L61-L63

Something like:

  // IPNotFoundError represents an error finding a specific IP during the deallocation process for a container.
  type IPNotFoundError struct {
      containerID string
  }

  func (e *IPNotFoundError) Error() string {
      return fmt.Sprintf("Did not find reserved IP for container %v", e.containerID)
  }

and to compare:

    _, err = storage.IPManagement(types.Deallocate, *ipamConf, args.ContainerID)
    if err != nil {
        if notFound := &(allocate.IPNotFoundError{}); errors.As(err, &notFound) {
		logging.Debugf("Cannot deallocate IP, it might not have been assigned by us. %s", err)
	} else {
            logging.Errorf("Error deallocating IP: %s", err)
            return fmt.Errorf("Error deallocating IP: %s", err)
        }
    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crandles Thanks!

Sounds good to me.

Also, I realized this PR has 2 unrelated patches clubbed together:

  1. 5cb45d9: To make whereabouts use hostNetwork by default
  2. 7083aa4: Handle non-existent IP del request gracefully

I'll split them into two PRs for better workflow and incorporate above comment to submit again.

Thanks!

Added the graceful handling for cases when the IPAM gets del
requests for IPs it did not allocate or not managing.

This situation can happen when whereabouts is added later as IPAM
for additional CNI in a cluster already deployed and processing
pod deletion. When this happens, kubelet keeps calling whereabouts
with del request for IPs it did not allocate and whereabouts keep
returning errors. This becomes endless loop between kubelet, which
holds on references to such pods and periodically keep calling all
CNI and IPAMs to cleanup.

With this fix, whereabouts will log such requests and return
gracefully ending possibility of such loops.
@ashish-billore
Copy link
Author

ok, so ds deployment related change for using hostNetwork is now moved out this pull request:
https://github.com/dougbtv/whereabouts/pull/45
I'll fix this pull request to include @crandles comments.

@ashish-billore ashish-billore changed the title Make ds use hostNetwork by default Handle pod delete with non-existent IPs Jul 13, 2020
Added the graceful handling for cases when the IPAM gets del
requests for IPs it did not allocate or not managing.

This situation can happen when whereabouts is added later as IPAM
for additional CNI in a cluster already deployed and processing
pod deletion. When this happens, kubelet keeps calling whereabouts
with del request for IPs it did not allocate and whereabouts keep
returning errors. This becomes endless loop between kubelet, which
holds on references to such pods and periodically keep calling all
CNI and IPAMs to cleanup.

With this fix, whereabouts will log such requests and return
gracefully ending possibility of such loops.
@ashish-billore
Copy link
Author

@dougbtv @crandles @nicklesimba @aneeshkp Please have a look and give your opinion.
Thanks!

Copy link
Collaborator

@crandles crandles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM, but curious when this scenario occurs

@ashish-billore
Copy link
Author

ashish-billore commented Jul 22, 2020

+1 LGTM, but curious when this scenario occurs

Thanks @crandles for the review!
In my case, I used an IP range with whereabouts and macvlan cni, created few pods and realized the range I used had conflict of IPs with another cluster, so I changed the IP range for whereabouts and then problem happened with following kubelet logs, which repeated endlessly.

As seen below, calico CNI and its IPAM was able to handle this situation gracefully with a warning. I felt whereabouts can also take a similar approach without erroring out on non-existent IPs. This will prevent endless attempts on kubelet to clean-up the pod sandbox.


Logs:

Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.727 [WARNING][12388] k8s.go 450: WorkloadEndpoint does not exist in the datastore, moving forward with the clean up ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" WorkloadEndpoint="node2.sl1--macvlan-k8s-mynginx--5777785f55--jrpfs-eth0" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.727 [INFO][12388] k8s.go 485: Cleaning up netns ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.727 [INFO][12388] k8s.go 492: Releasing IP address(es) ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.727 [INFO][12388] utils.go 168: Calico CNI releasing IP address ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.762 [INFO][12399] ipam_plugin.go 302: Releasing address using handleID ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" HandleID="k8s-pod-network.4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Workload="node2.sl1--macvlan-k8s-mynginx--5777785f55--jrpfs-eth0" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.762 [INFO][12399] ipam.go 1172: Releasing all IPs with handle 'k8s-pod-network.4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276' Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.775 [WARNING][12399] ipam_plugin.go 312: Asked to release address but it doesn't exist. Ignoring ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" HandleID="k8s-pod-network.4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Workload="node2.sl1--macvlan-k8s-mynginx--5777785f55--jrpfs-eth0" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.775 [INFO][12399] ipam_plugin.go 323: Releasing address using workloadID ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" HandleID="k8s-pod-network.4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Workload="node2.sl1--macvlan-k8s-mynginx--5777785f55--jrpfs-eth0" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.775 [INFO][12399] ipam.go 1172: Releasing all IPs with handle 'default.mynginx-5777785f55-jrpfs' Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.781 [INFO][12388] k8s.go 498: Teardown processing complete. ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Jul 9 02:31:39 node2 kubelet: E0709 02:31:39.785678 11876 cni.go:385] Error deleting default_mynginx-5777785f55-jrpfs/4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 from network multus/multus-cni-network: delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 / delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 Jul 9 02:31:39 node2 kubelet: E0709 02:31:39.790173 11876 remote_runtime.go:128] StopPodSandbox "4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "mynginx-5777785f55-jrpfs_default" network: delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 / delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 Jul 9 02:31:39 node2 kubelet: E0709 02:31:39.790199 11876 kuberuntime_gc.go:170] Failed to stop sandbox "4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" before removing: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "mynginx-5777785f55-jrpfs_default" network: delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 / delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 Jul 9 02:31:39 node2 kubelet: W0709 02:31:39.791959 11876 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e" Jul 9 02:31:39 node2 kubelet: Whereabouts logging: cannot open /var/log/whereabouts/whereabouts.log2020-07-09T02:31:39Z [debug] DEL - IPAM configuration successfully read: {Name:acn-lb-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[] DNS:{Nameservers:[] Domain: Search:[] Options:[]} Range:172.168.0.0/24 RangeStart:172.168.0.2 RangeEnd:172.168.0.254 GatewayStr:172.168.0.1 EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LogFile:/var/log/whereabouts/whereabouts.log LogLevel:debug Gateway:172.168.0.1 Kubernetes:{KubeConfigPath:/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:} Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [debug] ContainerID: 13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [debug] IPManagement -- mode: 1 / host: / containerID: 13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [error] Error deallocating IP: Did not find reserved IP for container 13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [error] Error deallocating IP: Did not find reserved IP for container 13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [debug] DEL - IPAM configuration successfully read: {Name:macvlan-1 Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[] DNS:{Nameservers:[] Domain: Search:[] Options:[]} Range:12.10.0.0/16 RangeStart:12.10.0.2 RangeEnd:12.10.255.254 GatewayStr:12.10.0.1 EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LogFile:/tmp/whereabouts.log LogLevel:debug Gateway:12.10.0.1 Kubernetes:{KubeConfigPath:/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:}

@ashish-billore
Copy link
Author

Actually, this is also the expected behavior for IPAM as mentioned in the CNI spec here:

Plugins should generally complete a DEL action without error even if some resources are missing. For example, an IPAM plugin should generally release an IP allocation and return success even if the container network namespace no longer exists, unless that network namespace is critical for IPAM management.

@ashish-billore
Copy link
Author

@crandles @dougbtv please suggest if anything is needed towards closing this pull request.
Thanks!

@mkt-mohit
Copy link

@ashish-billore ashish-billore @dougbtv , we are facing same issues when running cluster went down due to power outage .
(simulated using forcefully power down server)

Error

E1006 05:52:39.220542 5964 remote_runtime.go:128] StopPodSandbox "869c45444de0b540e1babd380ae6e642d6c446b0
1fc603adb16ee1115a475bc9" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for sandbox "869c45444de0b540e1ba
bd380ae6e642d6c446b01fc603adb16ee1115a475bc9": delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error
deallocating IP: Did not find reserved IP for container 869c45444de0b540e1babd380ae6e642d6c446b01fc603adb16ee1115a475bc9

Saw that same error log was thr and @crandles suggested some code changes.

Let us know if this is checked in or what is the plan to do that.

Appreciated

@ashish-billore
Copy link
Author

Saw that same error log was thr and @crandles suggested some code changes.

Let us know if this is checked in or what is the plan to do that.

@mkt-mohit Changes suggested in @crandles comments have been incorporated and I see it has been approved after that.

We have been using whereabouts patched with changes as mentioned in this PR, in our setup to handle such cases.
So, in case you want to check if it helps with your scenario, please feel free to use this patch, it should be simple to just patch whereabouts and build it locally.

@mkt-mohit
Copy link

Saw that same error log was thr and @crandles suggested some code changes.
Let us know if this is checked in or what is the plan to do that.

@mkt-mohit Changes suggested in @crandles comments have been incorporated and I see it has been approved after that.

We have been using whereabouts patched with changes as mentioned in this PR, in our setup to handle such cases.
So, in case you want to check if it helps with your scenario, please feel free to use this patch, it should be simple to just patch whereabouts and build it locally.

I have taken latest checked docker image, but still getting same error wherein it stuck in loop to delete.

is there anything we can work on this from here?
Suggestion appreciated

@ashish-billore
Copy link
Author

ashish-billore commented Oct 22, 2020

I have taken latest checked docker image, but still getting same error wherein it stuck in loop to delete.

The latest docker image does not have this patch included, as this PR is reviewed and approved but not yet merged.

is there anything we can work on this from here?

You need to build your own docker image of whereabouts with this patch included to try this fix.
Follow these instructions to make a locally built docker image of whereabouts from code with this patch applied.

@dougbtv
Copy link
Member

dougbtv commented Nov 17, 2020

Sorry this got lagged behind! LGTM. Merging.

@dougbtv
Copy link
Member

dougbtv commented Nov 17, 2020

Hrmmm, I also realize this conflicts with a recent change, so I need to rebase it. This solution is more ideal than what's currently there, I'll have to return to it.

@ashish-billore
Copy link
Author

Seems problem with CI/CD system:

$ test -z $(gofmt -l pkg/. cmd/.)

/home/travis/.travis/functions: line 109: test: too many arguments

The command "test -z $(gofmt -l pkg/. cmd/.)" failed and exited with 2 during .

Your build has been stopped.

@ashish-billore
Copy link
Author

recheck

@dougbtv
Copy link
Member

dougbtv commented Jun 23, 2022

Sorry this got lost. If it's still a concern, can you add a test and we can see if it still exists, then we can rebase, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants