Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Preview] Prefix delegation feature development #1434

Merged
merged 41 commits into from
Apr 30, 2021

Conversation

jayanthvn
Copy link
Contributor

What type of PR is this?
Feature

Which issue does this PR fix:
n/a

What does this PR do / Why do we need it:
Current implementation allocates secondary IPs per ENI which will be used for pod IPs. With the new VPC feature each secondary IP will be replaced with /28 prefix in the ENI's subnet. IPs will be allocated from this subnet.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

Yes and in progress

Automation added to e2e:

Yes

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
Yes

Does this change require updates to the CNI daemonset config files to work?:

Yes

Does this PR introduce any user-facing change?:

Yes

Prefix delegation feature

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jayanthvn jayanthvn self-assigned this Apr 21, 2021
@jayanthvn jayanthvn added this to the v1.9.0 milestone Apr 21, 2021
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
Copy link
Contributor

@M00nF1sh M00nF1sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with Jayanth offline.
This is for preview only and will only merge into preview branch instead of master branch.

Approving this one so that we produce a preview version for testing.
However, this PR do need more work before merging into master.

@jayanthvn
Copy link
Contributor Author

jayanthvn commented Apr 29, 2021

discussed with Jayanth offline.
This is for preview only and will only merge into preview branch instead of master branch.

Approving this one so that we produce a preview version for testing.
However, this PR do need more work before merging into master.

Thanks Yang. Since it is more of code rearranging part as discussed with you will take it up post preview :) :)

Copy link

@fawadkhaliq fawadkhaliq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm for preview

(Jayanth will address code structure feedback as a fast follow up)

@jayanthvn jayanthvn merged commit 604d00c into aws:prefix-delegation-preview Apr 30, 2021
Default: None

Specifies the number of free IPv4(/28) prefixes that the `ipamd` daemon should attempt to keep available for pod assignment on the node.
This environment variable overrides `WARM_ENI_TARGET`, `WARM_IP_TARGET` and `MINIMUM_IP_TARGET` and works when `ENABLE_PREFIX_DELEGATION`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this preview documentation!

I don't know what it means to have a WARM_PREFIX_TARGET available. Does this mean the same thing as WARM_IP_TARGET == 16 would mean, or does it mean to have an entire contiguous empty prefix? Do any of the existing variables WARM_ENI_TARGET, WARM_IP_TARGET, and MINIMUM_IP_TARGET work with prefix delegation when WARM_PREFIX_TARGET is unset?

Suppose WARM_PREFIX_TARGET == 1. At startup, two prefixes will be allocated: [A, B]. When the 17th pod is started, a third prefix will be allocated, [A, B, C]. Then when a 33rd pod is started, a fourth prefix will be allocated, [A, B, C, D]. If all pods but one are terminated in [A] and in [B], leaving three running pods, will the CNI return prefix D or keep it around as an available contiguous range?

I think this should be logically equivalent to WARM_IP_TARGET == 16 and I think this should be implemented by making WARM_IP_TARGET and MINIMUM_IP_TARGET work with prefix delegation. A /28 prefix is equivalent to allocating / deallocating 16 IPs at a time, so the existing abstraction seems like it could fit. This has the added benefit of allowing a cluster owner to specify e.g. WARM_IP_TARGET = 8 to avoid allocating the additional IP range on nodes that have workload combinations that are unlikely to require the additional range, yet providing it on nodes that end up with smaller workloads.

I don't see any obvious value to having a contiguous free /28 available if there are plenty of available IPs in other existing prefixes, which is why I think it should logically behave like WARM_IP_TARGET. Additionally, having this behave like WARM_IP_TARGET could align well with switching modes later on, where prefixes * 16 + allocated secondary IPs is the current number of allocated IPs and this can drain down secondary IPs or prefixes, over time, as the mode shifts one way or another.

Copy link
Contributor Author

@jayanthvn jayanthvn May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @asheldon,

Thanks for your feedback. With this feature(ENABLE_PREFIX_DELEGATION=true), we cannot support WARM_ENI_TARGET, WARM_IP_TARGET, and MINIMUM_IP_TARGET mainly because IPs will be reserved in chunks of 16 (/28 prefix). So say if you set WARM_IP_TARGET as 7, we need to still allocate 2 prefixes.

In the example which you have given for WARM_PREFIX_TARGET == 1, yes prefix D will be kept in the current implementation. I am planning to improve this by checking the total number of IPs and if it is sufficient for reaching the WARM_PREFIX_TARGET.

Yes it is logically equivalent to WARM_IP_TARGET == 16 but we need to allocate in chunks of /28 prefixes with this feature hence we cannot have fine grained controls like existing WARM targets. So even with WARM_IP_TARGET = 8 we will have to allocate 1 prefix of /28. Each secondary IP will be replaced by a /28 prefix hence we cannot support having both prefixes and secondary IPs.

Agreed on checking the number of IPs and then allocating a prefix if needed, I am working on that optimization but we cannot have fine grained WARM_IP or MINIMUM_IP with this feature.

Copy link
Contributor

@asheldon asheldon May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jayanthvn. Thank you for the reply. I think nodes with low density packing should require as few as one prefix, and nodes that have high-density packing should have as many prefixes as required, more or less. WARM_PREFIX_TARGET doesn't allow for both - the minimum will be two prefixes unless you give up the ability for a node to have more than one prefix worth of pods. Having each node use a minimum of 1+32 IPs makes total IP consumption on the network grow very fast.

The way I imagine this would work is different, and basically just treats WARM_IP_TARGET as a signal to IPAMD of how many IPs to keep in reserve, like it is now. WARM_IP_TARGET:

Specifies the number of free IP addresses that the ipamd daemon should attempt to keep available for pod assignment on the node.

In combination with ENABLE_PREFIX_DELEGATION=true, I would interpret this to set a floor on how many IPs we want available. For example, if WARM_IP_TARGET were 6 and I were running 8 pods, I would consume only one prefix. This is because one prefix is 16 IPs, 8 are used and the number of warm IPs is also 8. 8 warm IPs meets the WARM_IP_TARGET of 6, so a second prefix is not required. Then 3 more pods start on this node, bringing us to 11 running pods and only 5 available IPs. Since the WARM_IP_TARGET is 6, we would then allocate a second prefix. MINIMUM_IP_TARGET can essentially work the same. MINIMUM_PREFIXES=CEILING(MINIMUM_IP_TARGET/16). Similarly, nodes that reach 17 pods might consume only two prefixes, not three.

Nodes still consume a minimum of 1+16 IPs, but this gives cluster operators the ability to hint to the CNI how much buffer they really need around at a level that is lower than a single prefix, or more than a single prefix (but not an exact multiple of 16).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reg this - "the minimum will be two prefixes unless you give up the ability for a node to have more than one prefix worth of pods. Having each node use a minimum of 1+32 IPs makes total IP consumption on the network grow very fast." - The number of pods that can be scheduled on the nodes are not changed, do you mean something like, if we set WARM_PREFIX_TARGET = 1 and one node has 10 pods, then that node will have 1 new prefix + 6 IPs available and other node has just one pod, even then that node will have 1 new prefix + 15 IPs available, something like this? If the pod density per node is not high then WARM_PREFIX_TARGET can be set to 0. Allocating new prefixes are faster than allocating an ENI so mostly shouldn't impact much for the pod launch time.(I need to measure performance yet)

I agree this will be a good enhancement, but I don't think we can have it as WARM_IP_TARGET since it will change the definition from what we previously supported. As you mentioned WARM_IP_TARGET is the number of free IPs in datastore and if we support for PD then it will be similar to MINIUM_IP_TARGET because at the minimum a node can have either WARM_IP_TARGET number of IPs or (WARM_IP_TARGET + X) where X can go max up to 16, In the example you mentioned, if 11 are consumed then we will have in the warm pool [ 15 IPs + 6 (WARM_IP_TARGET) ].

Copy link
Contributor

@asheldon asheldon May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jayanthvn. Thanks again for your response and your time.

The number of pods that can be scheduled on the nodes are not changed, do you mean something like, if we set WARM_PREFIX_TARGET = 1 and one node has 10 pods, then that node will have 1 new prefix + 6 IPs available and other node has just one pod, even then that node will have 1 new prefix + 15 IPs available, something like this? If the pod density per node is not high then WARM_PREFIX_TARGET can be set to 0. Allocating new prefixes are faster than allocating an ENI so mostly shouldn't impact much for the pod launch time.

What I've seen in EKS is that when a pod is scheduled on a node that has 0 available IPs in IPAM, the pod gets into a bad state with no IP and has to be manually deleted. This can be triggered with scenarios like WARM_IP_TARGET=1 and scheduling just two pods at the same time. Because of this behavior, I don't feel comfortable running nodes in configurations that might temporarily reach 0 available IPs. This includes WARM_IP_TARGET=0 or WARM_PREFIX_TARGET=0.

I did not understand that the design here was to just-in-time allocate an additional prefix when all existing prefixes were exhausted and additional IPs were required, and I don't trust that behavior. I don't want to allocate the additional prefix at the last possible moment, but I also don't want to always allocate an extra 16 IPs. WARM_IP_TARGET is an in-between. I want to allocate the additional prefix when I'm running low on available IPs, like 6 IPs, not 0 or 16.

I don't think we can have it as WARM_IP_TARGET since it will change the definition from what we previously supported.

WARM_IP_TARGET has never been used before with ENABLE_PREFIX_DELEGATION, so I think it is okay for its definition to be adjusted to behave slightly differently in that context, as long as its behavior has the same essential reasoning to it. I want to tell the CNI how many IPs I want around just-in-case. If it can only deliver IPs in blocks of 16, well, I opted into that with ENABLE_PREFIX_DELEGATION, much the same way that the number of WARM_IPs can't fall below MINIMUM_IP_TARGET, which was introduced later.

When WARM_IP_TARGET is set below the maximum number of IPs on a single ENI, it balances the need to start pods rapidly with IP consumption. We can start at least WARM_IP_TARGET pods at any time, but we also don't consume an entire ENI worth of IPs (up to 50!) to make this guarantee.

When we use WARM_IP_TARGET or MINIMUM_IP_TARGET today, the CNI determines how many ENIs it needs to meet that need and allocates / deallocates ENIs on-demand. When WARM_IP_TARGET and MINIMUM_IP_TARGET are used in combination with ENABLE_PREFIX_DELEGATION, I believe it should work the same: the CNI determines how many PDs are required to meet these targets and allocates that many PDs. There will be at least WARM_IP_TARGET unused IPs in each PD, but the option may save entire prefixes worth of IP space.

As you mentioned WARM_IP_TARGET is the number of free IPs in datastore and if we support for PD then it will be similar to MINIUM_IP_TARGET because at the minimum a node can have either WARM_IP_TARGET number of IPs or (WARM_IP_TARGET + X) where X can go max up to 16, In the example you mentioned, if 11 are consumed then we will have in the warm pool [ 15 IPs + 6 (WARM_IP_TARGET) ].

I'm not following you here. If I enable prefix delegation and set MINIMUM_IP_TARGET=36 and WARM_IP_TARGET=4, then I I expect to always have at least three prefixes (to meet MINIMUM_IP_TARGET with 3*16=48 IPs). If I run 45 pods on this node, I would fall to only 3 available IPs, and would expect to allocate a fourth prefix, bringing my usage to 45 pods / 64 IPs. In the prior example, there are 11 pods running and a WARM_IP_TARGET of 6. Since a single PD contains only 16 IPs, there would be only 5 available IPs left and a second PD would be allocated. The minimum number of PDs consumed by the node would be equal to CEIL(WARM_IP_TARGET/16), so the minimum number of IPs consumed would be CEIL(WARM_IP_TARGET/16)*16.

IPs IN USE MINIMUM_IP_TARGET WARM_IP_TARGET WARM_PREFIX_TARGET PDs UNUSED IPs
33 UNSET UNSET 1 4 31
33 36 6 UNSET 3 15
47 UNSET UNSET 1 4 17
47 36 6 UNSET 4 17
49 UNSET UNSET 1 5 31
49 36 6 UNSET 4 15
21 20 10 UNSET 2 11
8 UNSET 6 UNSET 1 8
16 UNSET 4 UNSET 2 16
16 UNSET UNSET 0 1 0

Unless WARM_PREFIX_TARGET is set to 0, it always consumes the same or more IPs than what I am suggesting, but when it it set to 0, it can reach 0 available IPs which is very low.

Copy link
Contributor Author

@jayanthvn jayanthvn May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Aaron for the detailed explanation. As I mentioned this is a very nice enhancement and would like to take it up. I will open a follow up PR to add this support. This will be a good customer experience. Really appreciate your feedback and time for this :)

Default: `false`

To enable IPv4 prefix delegation on nitro instances. Setting `ENABLE_PREFIX_DELEGATION` flag toggle to `true` will start allocating a /28 prefix
instead of a secondary IP in the ENIs subnet. The total number of prefixes and private IP addresses will be less than the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check my understanding, this means that each secondary ENI will consume 16 IPs from its VPC, all of which will be available for use by pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats right.

To enable IPv4 prefix delegation on nitro instances. Setting `ENABLE_PREFIX_DELEGATION` flag toggle to `true` will start allocating a /28 prefix
instead of a secondary IP in the ENIs subnet. The total number of prefixes and private IP addresses will be less than the
limit on private IPs allowed by your instance. The current preview will support a single /28 prefix per ENI. Knob toggle while pods are running or if
ENIs are attached is not supported. On toggling the knob, node should be recycled to set the new kubelet max pods value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest more clear terminology here than "Knob toggle" or "toggling the knob". Do you mean that the CNI does not support enabling or disabling this feature without replacing each node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will reword it. I mean here when the feature is enabled or disabled, max pods will change hence will need a kubelet restart or new node group with updated max pods value.

jayanthvn added a commit that referenced this pull request Jun 24, 2021
…ter (#1516)

* [Preview] Prefix delegation feature development (#1434)

* ENABLE_PREFIX_DELEGATION knob

WARM_PREFIX_TARGET knob

cr https://code.amazon.com/reviews/CR-40610031

* PD changes - dev only

* Cooldown prefix IP

* minor fixes to support prefix count

* Code cleanup

* Handle few corner cases

* Nitro based check

* With custom networking, do not get prefix for primary ENI

* Code refactor

* Handle graceful upgrade/enable PD from disable PD

* code refactoring

* Code refactoring

* fix computing too low IPs

* UT for prefix store

* Fix UTs and handle CR comments

* Clean up SDK code and fix model code generation

* fix format and merge induced error

* Merge broke the code

* Fix Dockerfile.test

* Added IPAMD UTs and fixed removeENI total count

* Couple more IPAMD UTs for PD

* UTs for awsutils/imds

* Handle graceful PD enable to disable knob

* get prefix list for non-pd case

* Prevent reconcile of prefix IPs in IP datastore

* Handle disable scenario

* fix formatting

* clean up comment

* Remove unnecessary debugs

* Handle PR comments

* formatting fix

* Remodelled PD datastore

* Fix up UTs and fix Prefix nil

* formatting

* PR comments - minor cosmetic changes

* removed the sdk override from makefile

* Internal repo merge added these lines

* Update config file

* Handle wrapper of DescribeNetworkInterfacesWithContext to take one eni

* RemoveUnusedENIFromStore was not accounting for prefixes deleted

* Removed hardcoding of 16

* Code refactor - merge ENI's secondary and prefix store into single store of CIDRs  (#1471)

* Code refactor - merge to single DB

* remove few debugs

* remove prefix store files

* PR comments

* Fix up CR comments

* formatting

* Updated UT cases

* UT and formatting

* Minor fixes

* Minor comments

* Updated /32 store term

* remove unused code

* Multi-prefix and WARM/MIN IP targets support with PD (#1477)

* Multi-pd and WARM targets support

* cleanup

* Updated variable names

* Default prefix count to -1

* Get stats should be computed on the fly since CIDR pool can have /32 or /28

* Support for warm prefix 0

* code review comments

* PD test cases and readme update (#1478)

* Traffic test case and readme update

* Added testcases for warm ip/min ip with PD

* Testcases for prefix count

* Testcase for warm prefix along with warm ip/min ip

* Updated traffic test case while PD mode is flipped

* Fix minor comments

* pr comments

* added pods per eni

* fix up count

* Support mixed instances with PD (#1483)

* Support mixed instances with PD

* fix up the log

* Optimization for prefixes allocation (#1500)

* optimization for prefixes

Prefix store optimization

* pr comment

* Fixup eni allocation with warm targets (#1512)

* Fixup eni allocation with warm targets

* fixup cidr count

* code comments and warm prefix 0

* Default WARM_PREFIX_TARGET to 1 (#1515)

* Handle prefix target 0

* pr commets

* Fix up UTs, was failing because of vendor

* No need to commit this

* make format

* needed for UT workflow

* IMDS code refactor

* PR comments - v1

Error with merge

* PR comments v2

* PR comments - v3

* Update logs

* PR comments - v4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants