how to check and reclaim IPs? #149

hymgg · 2019-09-19T21:40:21Z

Hello,

The ClusterNetwork has a pool of 90+ IPs. Last pod started is using IP 10.200.20.27. Somehow new pods failed to get IPs with message "all addresses are reserved"

How to check and reclaim IPs?

apiVersion: danm.k8s.io/v1
kind: ClusterNetwork
metadata:
name: sriov-a
spec:
NetworkID: sriov-a
NetworkType: sriov
Options:
device_pool: "intel.com/sriov_net_A"
container_prefix: x4nic1vf
vlan: 64
rt_tables: 250
cidr: 10.200.20.0/24
allocation_pool:
start: 10.200.20.10
end: 10.200.20.100

Warning FailedCreatePodSandBox 2m38s (x273 over 7m28s) kubelet, mtx-hw2-bld03 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc│
│ = failed to set up sandbox container "7f73774820a81ff444a5508eb8535bb9780ea11effafe6f8e3b30f26b27ead52" network for pod "proc-s1e1-2": NetworkPlugin cni failed to set up pod "p│
│roc-s1e1-2_mtx-dev" network: CNI network could not be set up: CNI operation for network:sriov-a failed with:CNI delegation failed due to error:IP address reservation failed for │
│network:sriov-a with error:failed to allocate IP address for network:sriov-a with error:IPv4 address cannot be dynamically allocated, all addresses are reserved!

Thanks. -Jessica

Levovar · 2019-09-20T11:01:12Z

describe the network, Spec.Options.Alloc stores the current allocations.
are you using the latest master, or 4.0 released version?

have you removed the "UPDATE" from the Webhook's configuration, as in https://github.com/nokia/danm/pull/145/files#diff-317645100e8d8e72d588b15867c0c7d5R48?

hymgg · 2019-09-20T16:31:09Z

Thank you.

How to read Alloc: gD//////////////+AAAAAAAAAAAAAAAAAAAAAAAAAE= ?

Images were build 6 weeks ago, so after 4.0, but not the latest.

Was just adding 5 pods, 2 went through, 3 failed to get IPs.
Will remove UPDATE from danm-netvalidation.nokia.k8s.io.

$ kubectl describe cn sriov-a
Name: sriov-a
Namespace:
Labels:
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"danm.k8s.io/v1","kind":"ClusterNetwork","metadata":{"annotations":{},"name":"sriov-a"},"spec":{"NetworkID":"sriov-a","Netwo...
API Version: danm.k8s.io/v1
Kind: ClusterNetwork
Metadata:
Creation Timestamp: 2019-08-13T00:22:35Z
Generation: 123
Resource Version: 34362184
Self Link: /apis/danm.k8s.io/v1/clusternetworks/sriov-a
UID: 26bb1ebd-b943-4894-8aaa-a35f6cefe379
Spec:
Network ID: sriov-a
Network Type: sriov
Options:
Alloc: gD//////////////+AAAAAAAAAAAAAAAAAAAAAAAAAE=
allocation_pool:
End: 10.200.20.100
Start: 10.200.20.10
Cidr: 10.200.20.0/24
container_prefix: x4nic1vf
device_pool: intel.com/sriov_net_A
rt_tables: 250
Vlan: 64
Events:

hymgg · 2019-09-20T16:41:10Z

Removed all the pods that were using the sriov-a cluster network. It's still
Alloc: gD//////////////+AAAAAAAAAAAAAAAAAAAAAAAAAE=

How do I reset?

Levovar · 2019-09-21T21:33:17Z

you need to first delete all the Pods, and then recreate the network
we had this issue when the webhook was configured to handle UPDATEs. if you remove that, recreate the network, and the problem still persists then please signal, because then we need to further investigate

Levovar · 2019-09-21T21:36:14Z

also update to at least this commit: #123

hymgg · 2019-09-23T06:26:16Z

Thanks.

Deleted the pods, the sriov-a cn, the webhook deployment.
Created webhook w/o UPDATE in danm-netvalidation.nokia.k8s.io.
Created sriov-a cn.
Tried apply 5 pods again, 1 out of 4 failed. somehow, "all addresses are reserved" already.

Gonna try update tomorrow.

$ kubectl describe cn sriov-a
Name: sriov-a
Namespace:
Labels:
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"danm.k8s.io/v1","kind":"ClusterNetwork","metadata":{"annotations":{},"name":"sriov-a"},"spec":{"NetworkID":"sriov-a","Netwo...
API Version: danm.k8s.io/v1
Kind: ClusterNetwork
Metadata:
Creation Timestamp: 2019-09-23T06:06:47Z
Generation: 92
Resource Version: 35301974
Self Link: /apis/danm.k8s.io/v1/clusternetworks/sriov-a
UID: 08c7f32f-bfa2-46fb-a247-dbd10a0543c7
Spec:
Network ID: sriov-a
Network Type: sriov
Options:
Alloc: gD//////////////+AAAAAAAAAAAAAAAAAAAAAAAAAE=
allocation_pool:
End: 10.200.20.100
Start: 10.200.20.10
Cidr: 10.200.20.0/24
container_prefix: x4nic1vf
device_pool: intel.com/sriov_net_A
rt_tables: 250
Vlan: 64
Events:

│ Warning FailedCreatePodSandBox 3s (x4 over 7s) kubelet, mtx-huawei2-bld01 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = fai│
│led to set up sandbox container "40cb82877c009a7aa879c3181b8d6ce8b34474be72ced58ee7ddc8a8e9c9d37e" network for pod "proc-s1e1-2": NetworkPlugin cni failed to set up pod "proc-s1│
│e1-2_mtx-dev" network: CNI network could not be set up: CNI operation for network:sriov-a failed with:CNI delegation failed due to error:IP address reservation failed for networ│
│k:sriov-a with error:failed to allocate IP address for network:sriov-a with error:IPv4 address cannot be dynamically allocated, all addresses are reserved!

Levovar · 2019-09-23T22:05:44Z

yeah the thing is that it can easily happen that you had some real issues first because of which your sriov VF creations were legit failing, but because of the bug I corrected in the linked review the IP addresses allocated in quick succession were never Freed.
so you observe "exhaustion", while the root cause is that your config was already not good to begin with. plus the bug :)

so please update, but if the problem persists please send me the whole DANM log

hymgg · 2019-09-27T21:33:13Z

you're right, started over w/ 09/26 master, still same. Then rebooted master and worker nodes, hoping to cleanup whatever might be dirty in the cluster, which helped. The 5 pods came up right away in ns A.

(after reboot, when deleting cn, got message pod X is still using cn in ns Y, tried delete pod X in Y, got error pod not exist, then deleted ns Y, that did it, strange why pod X was remembered somewhere)

Luckily this is just a PoC environemnt ;o)

Thanks. -Jessica

Levovar added the support How? And why? label Sep 26, 2019

hymgg closed this as completed Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to check and reclaim IPs? #149

how to check and reclaim IPs? #149

hymgg commented Sep 19, 2019

Levovar commented Sep 20, 2019

hymgg commented Sep 20, 2019

hymgg commented Sep 20, 2019

Levovar commented Sep 21, 2019

Levovar commented Sep 21, 2019

hymgg commented Sep 23, 2019

Levovar commented Sep 23, 2019

hymgg commented Sep 27, 2019

how to check and reclaim IPs? #149

how to check and reclaim IPs? #149

Comments

hymgg commented Sep 19, 2019

Levovar commented Sep 20, 2019

hymgg commented Sep 20, 2019

hymgg commented Sep 20, 2019

Levovar commented Sep 21, 2019

Levovar commented Sep 21, 2019

hymgg commented Sep 23, 2019

Levovar commented Sep 23, 2019

hymgg commented Sep 27, 2019