Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving VPC RC's behavior for large accounts #411

Open
GnatorX opened this issue Apr 18, 2024 · 2 comments
Open

Improving VPC RC's behavior for large accounts #411

GnatorX opened this issue Apr 18, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@GnatorX
Copy link
Contributor

GnatorX commented Apr 18, 2024

What would you like to be enhanced:
Improve VPC RC's behavior when handling large accounts. Specifically on DescribeNetworkInterfaces calls.

Why is the change needed and what use case will it solve:

Currently, VPC RC makes 2 DescribeNetworkInterfaces calls where accounts with large number of network interfaces in a VPC and/or subnet can run into time outs.

Clean ENI

Currently Clean ENI runs every 30 minutes but attempts to get all ENIs within a VPC and work through all the returned ENIs. Since only vpc-id filter is "index"(?) on AWS' side, this could still return a huge amount of data and can run slow. I believe in our account it takes ~1.6 minute to return.

I suggest we should support running clean eni continuously where VPC RC would call DescribeNetworkInterfaces in pages and iterate through pages every N minutes or seconds rather than doing it in a single batch. This would provide a more consistent behavior without spamming AWS' APIs. One issue here is leaked ENIs could live for longer than previous (30 minutes) for larger accounts.

GetBranchNetworkInterface

This call is only made when rebuilding cache where nodes with trunk is already running in the cluster and VPC RC was restarted. This narrows in the call into just subnet ID however with sufficiently big account this is still a slow call. (we have around 40k ENIs in 1 AZ). Given that this path is only called during VPC RC start up and the tag filter isn't indexed, I suggest you shouldn't call per node in the cluster during VPC RC restart. Rather, you would call this on VPC RC start up and get all network interfaces that is a branch ENI (presences of the trunk eni tag only without checking the tag value) in a paginated manner every N second such that the cache would just build async in one go.

This should reduce the number of calls being made since we are not calling per Trunk ENI (per node) and calls are spread out per N seconds (dependent on what you found empirically make senses)

Big account flag

Lastly, some of these changes may not make sense for all accounts. I suggest we introduce a flag that indicate if an account is a large account and should behave differently. I am not 100% convinced about this yet without having more data on how these calls will perform when paginated and if we should differentiate between large and normal sized accounts

Similar things down in VPC CNI https://github.com/aws/amazon-vpc-cni-k8s/blob/cd7eb5902f5c7a0ebc008bb478843dd14440b8bd/pkg/awsutils/awsutils.go#L1811

@GnatorX GnatorX added the enhancement New feature or request label Apr 18, 2024
@sushrk
Copy link
Contributor

sushrk commented Apr 22, 2024

One issue here is leaked ENIs could live for longer than previous (30 minutes) for larger accounts

Leaked ENIs are around for ~1h today as the first time we encounter leaked it is added to cache and not immediately deleted, here.
It makes sense to spread this operation over few seconds or minutes as leaked ENIs can be deleted in async manner.

We are looking into adding pagination and improve the DescribeNetworkInterfaces EC2 API call volume.

@sushrk sushrk self-assigned this May 7, 2024
@GnatorX
Copy link
Contributor Author

GnatorX commented May 23, 2024

#188 seems related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants