Improving VPC RC's behavior for large accounts #411

GnatorX · 2024-04-18T18:28:25Z

What would you like to be enhanced:
Improve VPC RC's behavior when handling large accounts. Specifically on DescribeNetworkInterfaces calls.

Why is the change needed and what use case will it solve:

Currently, VPC RC makes 2 DescribeNetworkInterfaces calls where accounts with large number of network interfaces in a VPC and/or subnet can run into time outs.

Clean ENI

Currently Clean ENI runs every 30 minutes but attempts to get all ENIs within a VPC and work through all the returned ENIs. Since only vpc-id filter is "index"(?) on AWS' side, this could still return a huge amount of data and can run slow. I believe in our account it takes ~1.6 minute to return.

I suggest we should support running clean eni continuously where VPC RC would call DescribeNetworkInterfaces in pages and iterate through pages every N minutes or seconds rather than doing it in a single batch. This would provide a more consistent behavior without spamming AWS' APIs. One issue here is leaked ENIs could live for longer than previous (30 minutes) for larger accounts.

GetBranchNetworkInterface

This call is only made when rebuilding cache where nodes with trunk is already running in the cluster and VPC RC was restarted. This narrows in the call into just subnet ID however with sufficiently big account this is still a slow call. (we have around 40k ENIs in 1 AZ). Given that this path is only called during VPC RC start up and the tag filter isn't indexed, I suggest you shouldn't call per node in the cluster during VPC RC restart. Rather, you would call this on VPC RC start up and get all network interfaces that is a branch ENI (presences of the trunk eni tag only without checking the tag value) in a paginated manner every N second such that the cache would just build async in one go.

This should reduce the number of calls being made since we are not calling per Trunk ENI (per node) and calls are spread out per N seconds (dependent on what you found empirically make senses)

Big account flag

Lastly, some of these changes may not make sense for all accounts. I suggest we introduce a flag that indicate if an account is a large account and should behave differently. I am not 100% convinced about this yet without having more data on how these calls will perform when paginated and if we should differentiate between large and normal sized accounts

Similar things down in VPC CNI https://github.com/aws/amazon-vpc-cni-k8s/blob/cd7eb5902f5c7a0ebc008bb478843dd14440b8bd/pkg/awsutils/awsutils.go#L1811

sushrk · 2024-04-22T16:46:52Z

One issue here is leaked ENIs could live for longer than previous (30 minutes) for larger accounts

Leaked ENIs are around for ~1h today as the first time we encounter leaked it is added to cache and not immediately deleted, here.
It makes sense to spread this operation over few seconds or minutes as leaked ENIs can be deleted in async manner.

We are looking into adding pagination and improve the DescribeNetworkInterfaces EC2 API call volume.

GnatorX · 2024-05-23T18:51:52Z

#188 seems related

GnatorX added the enhancement New feature or request label Apr 18, 2024

sushrk self-assigned this May 7, 2024

jayanthvn assigned haouc and unassigned sushrk Jun 24, 2024

GnatorX mentioned this issue Aug 5, 2024

VPC RC taking upwards of 40 minutes to start up in big account #451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving VPC RC's behavior for large accounts #411

Improving VPC RC's behavior for large accounts #411

GnatorX commented Apr 18, 2024 •

edited

Loading

sushrk commented Apr 22, 2024

GnatorX commented May 23, 2024

Improving VPC RC's behavior for large accounts #411

Improving VPC RC's behavior for large accounts #411

Comments

GnatorX commented Apr 18, 2024 • edited Loading

Clean ENI

GetBranchNetworkInterface

Big account flag

sushrk commented Apr 22, 2024

GnatorX commented May 23, 2024

GnatorX commented Apr 18, 2024 •

edited

Loading