[Preview] Prefix delegation feature development #1434

jayanthvn · 2021-04-21T19:17:34Z

What type of PR is this?
Feature

Which issue does this PR fix:
n/a

What does this PR do / Why do we need it:
Current implementation allocates secondary IPs per ENI which will be used for pod IPs. With the new VPC feature each secondary IP will be replaced with /28 prefix in the ENI's subnet. IPs will be allocated from this subnet.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

Yes and in progress

Automation added to e2e:

Yes

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
Yes

Does this change require updates to the CNI daemonset config files to work?:

Yes

Does this PR introduce any user-facing change?:

Yes

Prefix delegation feature

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

WARM_PREFIX_TARGET knob cr https://code.amazon.com/reviews/CR-40610031

pkg/awsutils/awsutils.go

pkg/ipamd/datastore/data_store.go

pkg/awsutils/awsutils.go

M00nF1sh

discussed with Jayanth offline.
This is for preview only and will only merge into preview branch instead of master branch.

Approving this one so that we produce a preview version for testing.
However, this PR do need more work before merging into master.

jayanthvn · 2021-04-29T00:20:05Z

discussed with Jayanth offline.
This is for preview only and will only merge into preview branch instead of master branch.

Approving this one so that we produce a preview version for testing.
However, this PR do need more work before merging into master.

Thanks Yang. Since it is more of code rearranging part as discussed with you will take it up post preview :) :)

fawadkhaliq

/lgtm for preview

(Jayanth will address code structure feedback as a fast follow up)

asheldon · 2021-04-29T23:48:40Z

README.md

+Default: None
+
+Specifies the number of free IPv4(/28) prefixes that the `ipamd` daemon should attempt to keep available for pod assignment on the node.
+This environment variable overrides `WARM_ENI_TARGET`, `WARM_IP_TARGET` and `MINIMUM_IP_TARGET` and works when `ENABLE_PREFIX_DELEGATION`  


Thank you for writing this preview documentation!

I don't know what it means to have a WARM_PREFIX_TARGET available. Does this mean the same thing as WARM_IP_TARGET == 16 would mean, or does it mean to have an entire contiguous empty prefix? Do any of the existing variables WARM_ENI_TARGET, WARM_IP_TARGET, and MINIMUM_IP_TARGET work with prefix delegation when WARM_PREFIX_TARGET is unset?

Suppose WARM_PREFIX_TARGET == 1. At startup, two prefixes will be allocated: [A, B]. When the 17th pod is started, a third prefix will be allocated, [A, B, C]. Then when a 33rd pod is started, a fourth prefix will be allocated, [A, B, C, D]. If all pods but one are terminated in [A] and in [B], leaving three running pods, will the CNI return prefix D or keep it around as an available contiguous range?

I think this should be logically equivalent to WARM_IP_TARGET == 16 and I think this should be implemented by making WARM_IP_TARGET and MINIMUM_IP_TARGET work with prefix delegation. A /28 prefix is equivalent to allocating / deallocating 16 IPs at a time, so the existing abstraction seems like it could fit. This has the added benefit of allowing a cluster owner to specify e.g. WARM_IP_TARGET = 8 to avoid allocating the additional IP range on nodes that have workload combinations that are unlikely to require the additional range, yet providing it on nodes that end up with smaller workloads.

I don't see any obvious value to having a contiguous free /28 available if there are plenty of available IPs in other existing prefixes, which is why I think it should logically behave like WARM_IP_TARGET. Additionally, having this behave like WARM_IP_TARGET could align well with switching modes later on, where prefixes * 16 + allocated secondary IPs is the current number of allocated IPs and this can drain down secondary IPs or prefixes, over time, as the mode shifts one way or another.

Hi @asheldon,

Thanks for your feedback. With this feature(ENABLE_PREFIX_DELEGATION=true), we cannot support WARM_ENI_TARGET, WARM_IP_TARGET, and MINIMUM_IP_TARGET mainly because IPs will be reserved in chunks of 16 (/28 prefix). So say if you set WARM_IP_TARGET as 7, we need to still allocate 2 prefixes.

In the example which you have given for WARM_PREFIX_TARGET == 1, yes prefix D will be kept in the current implementation. I am planning to improve this by checking the total number of IPs and if it is sufficient for reaching the WARM_PREFIX_TARGET.

Yes it is logically equivalent to WARM_IP_TARGET == 16 but we need to allocate in chunks of /28 prefixes with this feature hence we cannot have fine grained controls like existing WARM targets. So even with WARM_IP_TARGET = 8 we will have to allocate 1 prefix of /28. Each secondary IP will be replaced by a /28 prefix hence we cannot support having both prefixes and secondary IPs.

Agreed on checking the number of IPs and then allocating a prefix if needed, I am working on that optimization but we cannot have fine grained WARM_IP or MINIMUM_IP with this feature.

Hi @jayanthvn. Thank you for the reply. I think nodes with low density packing should require as few as one prefix, and nodes that have high-density packing should have as many prefixes as required, more or less. WARM_PREFIX_TARGET doesn't allow for both - the minimum will be two prefixes unless you give up the ability for a node to have more than one prefix worth of pods. Having each node use a minimum of 1+32 IPs makes total IP consumption on the network grow very fast.

The way I imagine this would work is different, and basically just treats WARM_IP_TARGET as a signal to IPAMD of how many IPs to keep in reserve, like it is now. WARM_IP_TARGET:

Specifies the number of free IP addresses that the ipamd daemon should attempt to keep available for pod assignment on the node.

In combination with ENABLE_PREFIX_DELEGATION=true, I would interpret this to set a floor on how many IPs we want available. For example, if WARM_IP_TARGET were 6 and I were running 8 pods, I would consume only one prefix. This is because one prefix is 16 IPs, 8 are used and the number of warm IPs is also 8. 8 warm IPs meets the WARM_IP_TARGET of 6, so a second prefix is not required. Then 3 more pods start on this node, bringing us to 11 running pods and only 5 available IPs. Since the WARM_IP_TARGET is 6, we would then allocate a second prefix. MINIMUM_IP_TARGET can essentially work the same. MINIMUM_PREFIXES=CEILING(MINIMUM_IP_TARGET/16). Similarly, nodes that reach 17 pods might consume only two prefixes, not three.

Nodes still consume a minimum of 1+16 IPs, but this gives cluster operators the ability to hint to the CNI how much buffer they really need around at a level that is lower than a single prefix, or more than a single prefix (but not an exact multiple of 16).

Reg this - "the minimum will be two prefixes unless you give up the ability for a node to have more than one prefix worth of pods. Having each node use a minimum of 1+32 IPs makes total IP consumption on the network grow very fast." - The number of pods that can be scheduled on the nodes are not changed, do you mean something like, if we set WARM_PREFIX_TARGET = 1 and one node has 10 pods, then that node will have 1 new prefix + 6 IPs available and other node has just one pod, even then that node will have 1 new prefix + 15 IPs available, something like this? If the pod density per node is not high then WARM_PREFIX_TARGET can be set to 0. Allocating new prefixes are faster than allocating an ENI so mostly shouldn't impact much for the pod launch time.(I need to measure performance yet)

I agree this will be a good enhancement, but I don't think we can have it as WARM_IP_TARGET since it will change the definition from what we previously supported. As you mentioned WARM_IP_TARGET is the number of free IPs in datastore and if we support for PD then it will be similar to MINIUM_IP_TARGET because at the minimum a node can have either WARM_IP_TARGET number of IPs or (WARM_IP_TARGET + X) where X can go max up to 16, In the example you mentioned, if 11 are consumed then we will have in the warm pool [ 15 IPs + 6 (WARM_IP_TARGET) ].

Hi @jayanthvn. Thanks again for your response and your time.

The number of pods that can be scheduled on the nodes are not changed, do you mean something like, if we set WARM_PREFIX_TARGET = 1 and one node has 10 pods, then that node will have 1 new prefix + 6 IPs available and other node has just one pod, even then that node will have 1 new prefix + 15 IPs available, something like this? If the pod density per node is not high then WARM_PREFIX_TARGET can be set to 0. Allocating new prefixes are faster than allocating an ENI so mostly shouldn't impact much for the pod launch time.

What I've seen in EKS is that when a pod is scheduled on a node that has 0 available IPs in IPAM, the pod gets into a bad state with no IP and has to be manually deleted. This can be triggered with scenarios like WARM_IP_TARGET=1 and scheduling just two pods at the same time. Because of this behavior, I don't feel comfortable running nodes in configurations that might temporarily reach 0 available IPs. This includes WARM_IP_TARGET=0 or WARM_PREFIX_TARGET=0.

I did not understand that the design here was to just-in-time allocate an additional prefix when all existing prefixes were exhausted and additional IPs were required, and I don't trust that behavior. I don't want to allocate the additional prefix at the last possible moment, but I also don't want to always allocate an extra 16 IPs. WARM_IP_TARGET is an in-between. I want to allocate the additional prefix when I'm running low on available IPs, like 6 IPs, not 0 or 16.

I don't think we can have it as WARM_IP_TARGET since it will change the definition from what we previously supported.

WARM_IP_TARGET has never been used before with ENABLE_PREFIX_DELEGATION, so I think it is okay for its definition to be adjusted to behave slightly differently in that context, as long as its behavior has the same essential reasoning to it. I want to tell the CNI how many IPs I want around just-in-case. If it can only deliver IPs in blocks of 16, well, I opted into that with ENABLE_PREFIX_DELEGATION, much the same way that the number of WARM_IPs can't fall below MINIMUM_IP_TARGET, which was introduced later.

When WARM_IP_TARGET is set below the maximum number of IPs on a single ENI, it balances the need to start pods rapidly with IP consumption. We can start at least WARM_IP_TARGET pods at any time, but we also don't consume an entire ENI worth of IPs (up to 50!) to make this guarantee.

When we use WARM_IP_TARGET or MINIMUM_IP_TARGET today, the CNI determines how many ENIs it needs to meet that need and allocates / deallocates ENIs on-demand. When WARM_IP_TARGET and MINIMUM_IP_TARGET are used in combination with ENABLE_PREFIX_DELEGATION, I believe it should work the same: the CNI determines how many PDs are required to meet these targets and allocates that many PDs. There will be at least WARM_IP_TARGET unused IPs in each PD, but the option may save entire prefixes worth of IP space.

As you mentioned WARM_IP_TARGET is the number of free IPs in datastore and if we support for PD then it will be similar to MINIUM_IP_TARGET because at the minimum a node can have either WARM_IP_TARGET number of IPs or (WARM_IP_TARGET + X) where X can go max up to 16, In the example you mentioned, if 11 are consumed then we will have in the warm pool [ 15 IPs + 6 (WARM_IP_TARGET) ].

I'm not following you here. If I enable prefix delegation and set MINIMUM_IP_TARGET=36 and WARM_IP_TARGET=4, then I I expect to always have at least three prefixes (to meet MINIMUM_IP_TARGET with 3*16=48 IPs). If I run 45 pods on this node, I would fall to only 3 available IPs, and would expect to allocate a fourth prefix, bringing my usage to 45 pods / 64 IPs. In the prior example, there are 11 pods running and a WARM_IP_TARGET of 6. Since a single PD contains only 16 IPs, there would be only 5 available IPs left and a second PD would be allocated. The minimum number of PDs consumed by the node would be equal to CEIL(WARM_IP_TARGET/16), so the minimum number of IPs consumed would be CEIL(WARM_IP_TARGET/16)*16.

IPs IN USE MINIMUM_IP_TARGET WARM_IP_TARGET WARM_PREFIX_TARGET PDs UNUSED IPs

33 UNSET UNSET 1 4 31

33 36 6 UNSET 3 15

47 UNSET UNSET 1 4 17

47 36 6 UNSET 4 17

49 UNSET UNSET 1 5 31

49 36 6 UNSET 4 15

21 20 10 UNSET 2 11

8 UNSET 6 UNSET 1 8

16 UNSET 4 UNSET 2 16

16 UNSET UNSET 0 1 0

Unless WARM_PREFIX_TARGET is set to 0, it always consumes the same or more IPs than what I am suggesting, but when it it set to 0, it can reach 0 available IPs which is very low.

Thanks a lot Aaron for the detailed explanation. As I mentioned this is a very nice enhancement and would like to take it up. I will open a follow up PR to add this support. This will be a good customer experience. Really appreciate your feedback and time for this :)

asheldon · 2021-04-30T00:22:57Z

README.md

+Default: `false`
+
+To enable IPv4 prefix delegation on nitro instances. Setting `ENABLE_PREFIX_DELEGATION` flag toggle to `true` will start allocating a /28 prefix 
+instead of a secondary IP in the ENIs subnet. The total number of prefixes and private IP addresses will be less than the


Just to check my understanding, this means that each secondary ENI will consume 16 IPs from its VPC, all of which will be available for use by pods.

Yes thats right.

asheldon · 2021-04-30T00:49:02Z

README.md

+To enable IPv4 prefix delegation on nitro instances. Setting `ENABLE_PREFIX_DELEGATION` flag toggle to `true` will start allocating a /28 prefix 
+instead of a secondary IP in the ENIs subnet. The total number of prefixes and private IP addresses will be less than the
+limit on private IPs allowed by your instance. The current preview will support a single /28 prefix per ENI. Knob toggle while pods are running or if
+ENIs are attached is not supported. On toggling the knob, node should be recycled to set the new kubelet max pods value.


I would suggest more clear terminology here than "Knob toggle" or "toggling the knob". Do you mean that the CNI does not support enabling or disabling this feature without replacing each node?

Sure, will reword it. I mean here when the feature is enabled or disabled, max pods will change hence will need a kubelet restart or new node group with updated max pods value.

…ter (#1516) * [Preview] Prefix delegation feature development (#1434) * ENABLE_PREFIX_DELEGATION knob WARM_PREFIX_TARGET knob cr https://code.amazon.com/reviews/CR-40610031 * PD changes - dev only * Cooldown prefix IP * minor fixes to support prefix count * Code cleanup * Handle few corner cases * Nitro based check * With custom networking, do not get prefix for primary ENI * Code refactor * Handle graceful upgrade/enable PD from disable PD * code refactoring * Code refactoring * fix computing too low IPs * UT for prefix store * Fix UTs and handle CR comments * Clean up SDK code and fix model code generation * fix format and merge induced error * Merge broke the code * Fix Dockerfile.test * Added IPAMD UTs and fixed removeENI total count * Couple more IPAMD UTs for PD * UTs for awsutils/imds * Handle graceful PD enable to disable knob * get prefix list for non-pd case * Prevent reconcile of prefix IPs in IP datastore * Handle disable scenario * fix formatting * clean up comment * Remove unnecessary debugs * Handle PR comments * formatting fix * Remodelled PD datastore * Fix up UTs and fix Prefix nil * formatting * PR comments - minor cosmetic changes * removed the sdk override from makefile * Internal repo merge added these lines * Update config file * Handle wrapper of DescribeNetworkInterfacesWithContext to take one eni * RemoveUnusedENIFromStore was not accounting for prefixes deleted * Removed hardcoding of 16 * Code refactor - merge ENI's secondary and prefix store into single store of CIDRs (#1471) * Code refactor - merge to single DB * remove few debugs * remove prefix store files * PR comments * Fix up CR comments * formatting * Updated UT cases * UT and formatting * Minor fixes * Minor comments * Updated /32 store term * remove unused code * Multi-prefix and WARM/MIN IP targets support with PD (#1477) * Multi-pd and WARM targets support * cleanup * Updated variable names * Default prefix count to -1 * Get stats should be computed on the fly since CIDR pool can have /32 or /28 * Support for warm prefix 0 * code review comments * PD test cases and readme update (#1478) * Traffic test case and readme update * Added testcases for warm ip/min ip with PD * Testcases for prefix count * Testcase for warm prefix along with warm ip/min ip * Updated traffic test case while PD mode is flipped * Fix minor comments * pr comments * added pods per eni * fix up count * Support mixed instances with PD (#1483) * Support mixed instances with PD * fix up the log * Optimization for prefixes allocation (#1500) * optimization for prefixes Prefix store optimization * pr comment * Fixup eni allocation with warm targets (#1512) * Fixup eni allocation with warm targets * fixup cidr count * code comments and warm prefix 0 * Default WARM_PREFIX_TARGET to 1 (#1515) * Handle prefix target 0 * pr commets * Fix up UTs, was failing because of vendor * No need to commit this * make format * needed for UT workflow * IMDS code refactor * PR comments - v1 Error with merge * PR comments v2 * PR comments - v3 * Update logs * PR comments - v4

jayanthvn added 16 commits April 21, 2021 18:40

ENABLE_PREFIX_DELEGATION knob

0fae606

WARM_PREFIX_TARGET knob cr https://code.amazon.com/reviews/CR-40610031

PD changes - dev only

7303042

Cooldown prefix IP

d534141

minor fixes to support prefix count

4b64c1a

Code cleanup

29d0c31

Handle few corner cases

96669df

Nitro based check

87797e6

With custom networking, do not get prefix for primary ENI

2efa6a3

Code refactor

668711d

Handle graceful upgrade/enable PD from disable PD

911b1b1

code refactoring

3aa3ac7

Code refactoring

c493be1

fix computing too low IPs

d41e90a

UT for prefix store

392775b

Fix UTs and handle CR comments

ff2cb95

Clean up SDK code and fix model code generation

637b226

jayanthvn requested review from anguslees, srini-ram, nithu0115, achevuru and fawadkhaliq April 21, 2021 19:17

jayanthvn self-assigned this Apr 21, 2021

jayanthvn added this to the v1.9.0 milestone Apr 21, 2021

jayanthvn added 7 commits April 21, 2021 19:51

fix format and merge induced error

9f8c56b

Merge broke the code

15f581a

Fix Dockerfile.test

2b15cc7

Added IPAMD UTs and fixed removeENI total count

49f2d4f

Couple more IPAMD UTs for PD

7c9253c

UTs for awsutils/imds

a1e5aad

Handle graceful PD enable to disable knob

640d9e2

jayanthvn added 3 commits April 26, 2021 22:00

Handle PR comments

bc48dba

formatting fix

aa0dacd

Remodelled PD datastore

988f98f