-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VPC Resource Controller deleting branch ENIs erroneously #412
Comments
Hey @GnatorX, can you share the cluster ARN to [email protected]? Also if you have the pod name or ENI ID of the branch ENI that was deleted so I can look into the logs and investigate this further. |
This isn't on EKS so we don't really have an ARN. However looking at the code I no longer know for sure why pod wasn't found however I do believe API server being slow made this issue show up. I am not familiar enough with how the DeltaFIFO and optimizedListwatch works. I still believe there was an inconsistently in the cache when this occurred and that self healing should be done but I don't know the full mechanism of why |
Oh right, are you able to share the controller logs? |
I created a support ticket with the information |
The nodes are added to be managed by the Node manager, called from node reconciler, see comment I tried to reproduce the issue with ~9k pods, can confirm that node reconcile does not happen until the pod datastore has completed sync
I see there were 4k nodes on this cluster. What is the scale of the pods using security group for pods on cluster? Are all nodes enabled for this feature? |
I believe you are correct, looking at the logs I believe the pod datastore is synced prior to node controller starting. I do believe something is wrong but I can't be tell what. From what I can tell VPC RC did not find pods on nodes that has existed for a while. It is worth noting that during the incident, we were scaling up from 1.5k nodes to 3k nodes, our k8s API server is under provisioned so it saw high CPU and memory and some VPC RCs were restarted. Let me see if I can get the full log during that time. |
However I think without fully finding the root cause, I am wondering if it make sense for VPC RC to perform destructive actions based purely on cached state of the world. Specifically, its likely find to identify candidates for certain action via the cache but VPC RC should verify directly with the API server prior to taking an action. In addition, when VPC RC find a pod where the branch ENI is now missing shouldn't VPC RC re-attempt reattaching a branch ENI? When a pod is alive without a branch ENI in this way it still thinks it is running fine, I believe kubelet things it is healthy because local health check still works however all incoming calls to this pod is now broken because the IP the pod has is now used else where or not used. |
We have considered querying API server directly to list pods and bypass the cache previously. However, at large scale, list calls are quite expensive on the API server, etcd, and the client. It would likely cause API throttling with the API priority and fairness.
Was there a controller restart during the scale operations? The local cache will be rebuilt on controller start by comparing the list of running pods on node and branch ENIs from the EC2 API.
Is this the same ENI? In the logs shared, I do not see this log for any of the ENIs. Can you share the controller logs from your deployment at the time of the issue? |
Not list call but during branch ENI deletion rather than relying on the cache VPC RC can calling the
Yes I also updated the ticket with more logs. I was running 1.4.6 version of the controller |
I notice this:
Is this the issue? Are we skipping pods because we are referring to the address the same address where the value is changing every iteration? This means that we could be saving only the last pod's information every "page" for list pods |
That's a good point @GnatorX, can you upgrade to v1.4.7 where the issue is fixed? I understand this is difficult to reproduce, but using the latest version is recommended. I'll need to dig into the code a bit to understand the consequence of this during high API server load, will post an update this week. |
Updated the AWS ticket with what we believe is the issue |
This shouldn't be an issue even with golang < 1.22. For older version of golang, the |
I don't think |
Sorry, I updated the issue with the root cause we found. Node controller starting before cache sync was incorrect. |
You can test this by forcing page limit to 1 and attempt to start VPC RC against a large cluster. If you didn't change etcd compaction interval (default is 5 minutes) you might hit it. You could also test by changing compaction interval to 1 minute or something smaller. |
The issue is because of the optimize list watcher in custom controller. First let's establish that the lowest level list call is going to use the listFunc Then I am going to skip most of controller-runtime code because its hard to go through all that detail but eventually we call into reflector's Run on start up which runs ListandWatch which falls back to list Client go then wraps the listfunc in list SimplePageFunc Then starts listing with ListWithAlloc That calls into pager's list. Now let say we hit 410 (ResourceExpired), client go falls back into full list against the API server at a resource version. This is where the bug occurs because PageFn which is listfunc which will override limit BUT because this is a fallback it immediately returns paginated results https://github.com/kubernetes/client-go/blob/master/tools/pager/pager.go#L124. This results in partial results in the cache. This follows what we saw.
|
Thanks, I'm looking into this, will reproduce with the steps you shared above. IIUC this is the issue? That we should not be returning the partial list if error is not nil, specifically when we get a 410? |
The issue is actually
What happens during 410 is that client go has a fallback that performs a full list call against the API server but because of the optimize list watcher, it overrides the limit causing the fallback to own list the limit you set. 410 occurs when etcd compaction occurs and removes the resource version that you started the list call from. https://kubernetes.io/docs/reference/using-api/api-concepts/#410-gone-responses. |
Let me know if it would be better if we jump on a call to go over this |
Sorry for the late response, I haven't had a chance to take a look at this yet. Did you encounter the issue again? |
Ya we had it reoccur. One way to test is change the optimizedlistwatcher to sleep for 5+ minutes between pages (assuming your etcd compaction interval is default). This should trigger the 410 from API server. Then you can inspect your cache by listing all pods in cache to see |
Hi @GnatorX, sorry for the delay in getting back on this.
Did you also have to update the page size to reproduce this? What is the pod/node scale on your cluster? Is it always reproducible? Can you also share the 410 error you are observing? |
If this is a test cluster it might be hard to reproduce. One thing you might have to do is make writes happen on etcd. I would say you would add the sleep + on the side you need to create new pods on the cluster so the resource version get increased. Since etcd compaction would compact all version prior to latest. I don't think changing page size here makes a difference. |
@GnatorX, thank you for your valuable insights in reproducing this bug. I've successfully replicated the issue and submitted a PR to address it. Given your deep understanding of the problem, I'd appreciate your review of the PR. |
For sure let me take a look |
Most probably we're hitting the same for pods using Security Groups for Pod feature. |
@mwos-sl thanks for reporting a similar case. We are working on the PR to close on this issue. |
We can close this issue against #452 |
Describe the Bug:
Observed Behavior:
And then at 9:45 PM
During this time our control plane has higher than normal load causing slow response.
Diving deep into the code. What happened is InitTrunk was ran where it GetBranchNetworkInterface then compare it to all the pods it found on the node
https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/provider/branch/trunk/trunk.go#L271
If it found any associated branch ENIs without pods, it deletes those.
The pod is backed by a cached client https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/main.go#L292
The issue here is that during cache hydration (paginated list calls), API server returned a 410. This occurs because list calls took longer than etcd compaction interval.
This stops the list call but controller runtime allows controller runtime to move forward thinking cache is fully hydrated. This means VPC RC had a partial cache of pods. This caused VPC RC to delete branch ENIs for pods it didn't find.
Expected Behavior:
This pose 2 major issues.
amazon-vpc-resource-controller-k8s/pkg/provider/branch/trunk/trunk.go
Line 257 in 8fe171a
Lastly, if pods are found in this state why isn't VPC RC attempting to re-associate and new branch ENI and leaving the pod running in a broken state?
How to reproduce it (as minimally and precisely as possible):
This is likely diffcult to reproduce and requires a lot of pressure on API server to show. We were load testing scaling from 0 to around 4k nodes with VPC CNI and VPC RC running on a 5 node control plane of m5.8xlarge where we saw 100% CPU and memory on k8s 1.23
Additional Context:
Environment:
kubectl version
):The text was updated successfully, but these errors were encountered: