-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate IP addresses at scale: Possible read/write locking problem? #110
Comments
/cc @JanScheurich |
The Kubernetes CRD backend does not use locks, but does use what was at the time a recommended alternative: whereabouts/pkg/storage/kubernetes.go Lines 182 to 199 in 5e8cafd
The intent here was to use JSON patch for the update operation, with added tests to:
Perhaps something is going wrong with the patch, or how the tests were assumed to work. How many IPs are in use when this occurs? Could we be hitting our upper bound? I know that we have one, but I have not looked into it before. I am a few versions behind, and only use a few IPs so may not be hitting this edge case. |
We see this issue massively (~5 duplicates per IP address) already with a 100 pod deployment in a cluster of 80 node when using a huge /8 IPv4 range. So lack of available IPs we can rule out. In the kubelet logs we observed a huge number of where about retries due to failed patch operations. Apparently the concurrency protection of the patch is working at times but does not prevent duplicate assignments fully. I can't rule out that we hit the upper limit of retries, but If I look at the code that should lead to IP address assignment failure. We don't see that. In non-trivial cluster and application deployments, I think the locking approach of the etcd datastore is much more suitable. The lock-less K8s approach, even if it were safe, leads to a lot of blind load on the K8s API due to systematically failing patch attempts. |
@crandles I hope this still makes write to succeed just by overwriting an existing ip address mapped to another pod in |
@pperiyasamy : I have not found any specific documentation about the patch operation of the K8s API that would describe the meaning of tests within a patch and how the K8s API reacts to failing tests. My naive interpretation of what should happen inside the k8s API server is:
With the specific test on the original resource version, this should only let one of a set of conflicting concurrent patches succeed (assuming, of course that the resource version is stepped with every update). The remaining patches should fail and trigger a complete retry (read, modify, patch). Conceptually this is similar to a spin lock using atomic compare&swap instructions, except here all of the involved operations are heavy on the K8s API. So this doesn't seem like a good approach in scenarios where many transactions happen in parallel, as is systematically the case when K8s replicasets, statefulsets or daemonsets are deployed/deleted with many pod replicas. |
re: upper limit, i meant limit on number of IPs we can allocate. This is going to depend on the maximum size of CRD resources. I'm not sure what the maximum number of address we can store in one pool is. (doing some rough math, I suspect we can fit around 22k ips in a single pool)
Correct, but locking doesn't prevent that either. I just looked at the lock implementation in etcd, and I think we could do something similar by adding an owner or similar annotation to lock the resource, if it is present watch and wait, and releasing it upon completion. |
Thanks for the collaborative exchange, and Chris I really like the idea about adding a locking mechanism with an "owner" annotation -- I think that's probably pretty feasible. Tomo also mentioned the possibility of using some kind of leader election type of mechanism, he had linked: |
@dougbtv I too think so, the locking mechanism with an "owner" annotation looks to be simple approach as it can just follow Lock and Unlock syntax which would make Store APIs to be in intact. I guess there won't be still any race conditions with this approach. |
Just an FYI that this problem has been under active development. Essentially, what it comes down to is that Peri and Tomo have discussed at length designs for moving forward, and we have proposed a number of fixes which have been locally tested for general usage, but are awaiting some tests at scale -- this is non-trivial to replicate locally. Tomo has proposed a leader election pull request @ #113 Peri has two pull requests available @ #114 & #115 which use a "lock" CRD to denote that a particular instance has acquired the lock via CRD. Tomo's main consideration is that using the k8s libraries to solve for this issue ensures long-term maintainability. Peri has proposed a change to the way we use |
During Peri's and my testing we found out that the actual root cause for the duplicate IPs was an incorrect handling of the K8s client context timeout. Instead of an error, the current IP was returned to the CNI at context timeout, despite the fact that the IP address was not successfully marked as allocated in the IP pool. With a corrected context timeout handling, both the original Patch API and the simplified Update API were working correctly without causing duplicate IPs. We then continued testing to compare the behavior and performance of the different proposed solutions:
As test case we create and delete a K8s deployment with 500 pod replicas on bare metal cluster of ~70 worker nodes. Each pod had a single secondary OVS CNI interface with whereabouts IPAM. Without any whereabouts patch this always resulted in a significant number of duplicate IPs. A major consequence of corrected context timeout handling is that the CNI ADD or DEL commands now fail, triggering Kubelet to retry the pod sandbox creation or deletion. We observed that Kubelet apparently retries sandbox creation infinitely, but sandbox deletion only once. For a deployment of 500 pods that can give rise to thousands of FailedCreatePodSandBox pod Warnings with reason "context deadline exceeded". Eventually the deployment succeeds, but it takes quite long and the event noise doesn't look good. The times it took until all 500 pods are in Running state is:
In contrast to 1-3, the leader Election solution took 7-8 times as long but produced only 34 "context deadline exceeded" Warning events and Kubelet retries. Worse than the warning noise during pod creation is that solutions 1-3 all leave approx. 450 stale IPs in the pool when the deployment of 500 pods is deleted. The root cause are again the context timeouts together with fact that Kubelet retries deletion of the pod sandbox only once and then deletes the pod no matter if the CNI DEL was successful or not. The Leader election patch did not leave stale IPs or "context deadline exceeded" Warnings but took in the order of 12 minutes to delete all 500 pods. We believe the large number of context timeouts of solutions 1-3 are a consequence of the many quasi-parallel K8s API operations triggered by the independent but highly synchronized CNI ADD/DEL operations. As each CNI retries its failing K8s API operation many times, the API rate limiter kicks in and slows down the requests even more. We have tried with several back-off schemes in whereabout's K8s storage backend, but didn't manage to improve the bad behavior significantly. All in all, the tests have confirmed our suspicion that the K8s API server with its optimistic locking and client retry paradigm is a particularly bad choice as storage backend for whereabouts IP pool use case. The etcd backend with its blocking lock API is much more efficient and suitable. The leader election approach appears to be working more correctly and with less noise, but its performance still seems unacceptably slow. We haven't investigated that further, yet. To rely on the K8s API server as storage backend, we'd probably need to redesign the whereabouts IPAM along the lines of Calico IPAM to assign IP blocks to individual nodes through the K8s API server and manage the IP blocks locally on each node without further use of the API server. |
We have tried #127 on a 100 node setup with deployment/undeployment of 500 pods with 2500 ms backoff configuration. It solved the problem of both duplicate IP assignment and stale IPs. Following was the time taken - |
@dougbtv Has this issue been fixed in the latest whereabouts release 0.5 or 0.5.1? |
Just a note that 0.5.0 and 0.5.1 should have the race condition fixes, I recommend 0.5.1 as it also has some bug fixes for ip reconciliation. https://github.com/k8snetworkplumbingwg/whereabouts/releases/tag/v0.5.1 |
…nsistency-openshift-4.13-ose-multus-whereabouts-ipam-cni Updating ose-multus-whereabouts-ipam-cni images to be consistent with ART
I have same ip conflict issue in my k8s environment using whereabouts v0.6.3. To fix the ip conflict issue, I read the whereabouts codes, I found that whereabouts has several bad design :
So I wrote my own etcd backend claude ipam , so far works well in my k8s environment. |
@yowenter, please try with version v0.8.0, we fixed some issues regarding ip conflicts |
@pperiyasamy has reported that they're seeing duplicate IP addresses on a large k8s cluster, using the k8s backend.
He's pointed out that a lock may not be acquired like it is in the etcd implementation (the same problem does not appear when using the etcd backend)
https://github.com/k8snetworkplumbingwg/whereabouts/blob/master/pkg/storage/storage.go#L82-L108
Does this need to have a further locking mechanism?
How to test for the duplicate IP problem with this scaled up concurrency?
The text was updated successfully, but these errors were encountered: