Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to check and reclaim IPs? #149

Closed
hymgg opened this issue Sep 19, 2019 · 8 comments
Closed

how to check and reclaim IPs? #149

hymgg opened this issue Sep 19, 2019 · 8 comments
Labels
support How? And why?

Comments

@hymgg
Copy link

hymgg commented Sep 19, 2019

Hello,

The ClusterNetwork has a pool of 90+ IPs. Last pod started is using IP 10.200.20.27. Somehow new pods failed to get IPs with message "all addresses are reserved"

How to check and reclaim IPs?

apiVersion: danm.k8s.io/v1
kind: ClusterNetwork
metadata:
name: sriov-a
spec:
NetworkID: sriov-a
NetworkType: sriov
Options:
device_pool: "intel.com/sriov_net_A"
container_prefix: x4nic1vf
vlan: 64
rt_tables: 250
cidr: 10.200.20.0/24
allocation_pool:
start: 10.200.20.10
end: 10.200.20.100

Warning FailedCreatePodSandBox 2m38s (x273 over 7m28s) kubelet, mtx-hw2-bld03 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc│
│ = failed to set up sandbox container "7f73774820a81ff444a5508eb8535bb9780ea11effafe6f8e3b30f26b27ead52" network for pod "proc-s1e1-2": NetworkPlugin cni failed to set up pod "p│
│roc-s1e1-2_mtx-dev" network: CNI network could not be set up: CNI operation for network:sriov-a failed with:CNI delegation failed due to error:IP address reservation failed for │
│network:sriov-a with error:failed to allocate IP address for network:sriov-a with error:IPv4 address cannot be dynamically allocated, all addresses are reserved!

Thanks. -Jessica

@Levovar
Copy link
Collaborator

Levovar commented Sep 20, 2019

describe the network, Spec.Options.Alloc stores the current allocations.
are you using the latest master, or 4.0 released version?

have you removed the "UPDATE" from the Webhook's configuration, as in https://github.com/nokia/danm/pull/145/files#diff-317645100e8d8e72d588b15867c0c7d5R48?

@hymgg
Copy link
Author

hymgg commented Sep 20, 2019

Thank you.

How to read Alloc: gD//////////////+AAAAAAAAAAAAAAAAAAAAAAAAAE= ?

Images were build 6 weeks ago, so after 4.0, but not the latest.

Was just adding 5 pods, 2 went through, 3 failed to get IPs.
Will remove UPDATE from danm-netvalidation.nokia.k8s.io.

$ kubectl describe cn sriov-a
Name: sriov-a
Namespace:
Labels:
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"danm.k8s.io/v1","kind":"ClusterNetwork","metadata":{"annotations":{},"name":"sriov-a"},"spec":{"NetworkID":"sriov-a","Netwo...
API Version: danm.k8s.io/v1
Kind: ClusterNetwork
Metadata:
Creation Timestamp: 2019-08-13T00:22:35Z
Generation: 123
Resource Version: 34362184
Self Link: /apis/danm.k8s.io/v1/clusternetworks/sriov-a
UID: 26bb1ebd-b943-4894-8aaa-a35f6cefe379
Spec:
Network ID: sriov-a
Network Type: sriov
Options:
Alloc: gD//////////////+AAAAAAAAAAAAAAAAAAAAAAAAAE=
allocation_pool:
End: 10.200.20.100
Start: 10.200.20.10
Cidr: 10.200.20.0/24
container_prefix: x4nic1vf
device_pool: intel.com/sriov_net_A
rt_tables: 250
Vlan: 64
Events:

@hymgg
Copy link
Author

hymgg commented Sep 20, 2019

Removed all the pods that were using the sriov-a cluster network. It's still
Alloc: gD//////////////+AAAAAAAAAAAAAAAAAAAAAAAAAE=

How do I reset?

@Levovar
Copy link
Collaborator

Levovar commented Sep 21, 2019

you need to first delete all the Pods, and then recreate the network
we had this issue when the webhook was configured to handle UPDATEs. if you remove that, recreate the network, and the problem still persists then please signal, because then we need to further investigate

@Levovar
Copy link
Collaborator

Levovar commented Sep 21, 2019

also update to at least this commit: #123

@hymgg
Copy link
Author

hymgg commented Sep 23, 2019

Thanks.

Deleted the pods, the sriov-a cn, the webhook deployment.
Created webhook w/o UPDATE in danm-netvalidation.nokia.k8s.io.
Created sriov-a cn.
Tried apply 5 pods again, 1 out of 4 failed. somehow, "all addresses are reserved" already.

Gonna try update tomorrow.

$ kubectl describe cn sriov-a
Name: sriov-a
Namespace:
Labels:
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"danm.k8s.io/v1","kind":"ClusterNetwork","metadata":{"annotations":{},"name":"sriov-a"},"spec":{"NetworkID":"sriov-a","Netwo...
API Version: danm.k8s.io/v1
Kind: ClusterNetwork
Metadata:
Creation Timestamp: 2019-09-23T06:06:47Z
Generation: 92
Resource Version: 35301974
Self Link: /apis/danm.k8s.io/v1/clusternetworks/sriov-a
UID: 08c7f32f-bfa2-46fb-a247-dbd10a0543c7
Spec:
Network ID: sriov-a
Network Type: sriov
Options:
Alloc: gD//////////////+AAAAAAAAAAAAAAAAAAAAAAAAAE=
allocation_pool:
End: 10.200.20.100
Start: 10.200.20.10
Cidr: 10.200.20.0/24
container_prefix: x4nic1vf
device_pool: intel.com/sriov_net_A
rt_tables: 250
Vlan: 64
Events:

│ Warning FailedCreatePodSandBox 3s (x4 over 7s) kubelet, mtx-huawei2-bld01 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = fai│
│led to set up sandbox container "40cb82877c009a7aa879c3181b8d6ce8b34474be72ced58ee7ddc8a8e9c9d37e" network for pod "proc-s1e1-2": NetworkPlugin cni failed to set up pod "proc-s1│
│e1-2_mtx-dev" network: CNI network could not be set up: CNI operation for network:sriov-a failed with:CNI delegation failed due to error:IP address reservation failed for networ│
│k:sriov-a with error:failed to allocate IP address for network:sriov-a with error:IPv4 address cannot be dynamically allocated, all addresses are reserved!

@Levovar
Copy link
Collaborator

Levovar commented Sep 23, 2019

yeah the thing is that it can easily happen that you had some real issues first because of which your sriov VF creations were legit failing, but because of the bug I corrected in the linked review the IP addresses allocated in quick succession were never Freed.
so you observe "exhaustion", while the root cause is that your config was already not good to begin with. plus the bug :)

so please update, but if the problem persists please send me the whole DANM log

@Levovar Levovar added the support How? And why? label Sep 26, 2019
@hymgg
Copy link
Author

hymgg commented Sep 27, 2019

you're right, started over w/ 09/26 master, still same. Then rebooted master and worker nodes, hoping to cleanup whatever might be dirty in the cluster, which helped. The 5 pods came up right away in ns A.

(after reboot, when deleting cn, got message pod X is still using cn in ns Y, tried delete pod X in Y, got error pod not exist, then deleted ns Y, that did it, strange why pod X was remembered somewhere)

Luckily this is just a PoC environemnt ;o)

Thanks. -Jessica

@hymgg hymgg closed this as completed Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support How? And why?
Projects
None yet
Development

No branches or pull requests

2 participants