-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Preview] Prefix delegation feature development #1434
Changes from all commits
0fae606
7303042
d534141
4b64c1a
29d0c31
96669df
87797e6
2efa6a3
668711d
911b1b1
3aa3ac7
c493be1
d41e90a
392775b
ff2cb95
637b226
9f8c56b
15f581a
2b15cc7
49f2d4f
7c9253c
a1e5aad
640d9e2
6ece535
2570182
9ac41eb
6962997
69e89ea
a118f80
bc48dba
aa0dacd
988f98f
fb3d2af
8681742
679441d
ad78c47
0fb636a
73d1989
fbbc79f
378816e
ec76c66
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,4 +11,5 @@ portmap | |
grpc-health-probe | ||
cni-metrics-helper | ||
coverage.txt | ||
build/ | ||
build/ | ||
vendor |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -450,6 +450,30 @@ You can use the below command to enable `DISABLE_TCP_EARLY_DEMUX` to `true` - | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
kubectl patch daemonset aws-node -n kube-system -p '{"spec": {"template": {"spec": {"initContainers": [{"env":[{"name":"DISABLE_TCP_EARLY_DEMUX","value":"true"}],"name":"aws-vpc-cni-init"}]}}}}' | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
--- | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
`ENABLE_PREFIX_DELEGATION` (Since v1.9) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Type: Boolean as a String | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Default: `false` | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
To enable IPv4 prefix delegation on nitro instances. Setting `ENABLE_PREFIX_DELEGATION` flag toggle to `true` will start allocating a /28 prefix | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
instead of a secondary IP in the ENIs subnet. The total number of prefixes and private IP addresses will be less than the | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
limit on private IPs allowed by your instance. The current preview will support a single /28 prefix per ENI. Knob toggle while pods are running or if | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ENIs are attached is not supported. On toggling the knob, node should be recycled to set the new kubelet max pods value. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
jayanthvn marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest more clear terminology here than "Knob toggle" or "toggling the knob". Do you mean that the CNI does not support enabling or disabling this feature without replacing each node? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, will reword it. I mean here when the feature is enabled or disabled, max pods will change hence will need a kubelet restart or new node group with updated max pods value. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
--- | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
`WARM_PREFIX_TARGET` | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Type: Integer | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Default: None | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Specifies the number of free IPv4(/28) prefixes that the `ipamd` daemon should attempt to keep available for pod assignment on the node. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This environment variable overrides `WARM_ENI_TARGET`, `WARM_IP_TARGET` and `MINIMUM_IP_TARGET` and works when `ENABLE_PREFIX_DELEGATION` | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for writing this preview documentation! I don't know what it means to have a Suppose WARM_PREFIX_TARGET == 1. At startup, two prefixes will be allocated: [A, B]. When the 17th pod is started, a third prefix will be allocated, [A, B, C]. Then when a 33rd pod is started, a fourth prefix will be allocated, [A, B, C, D]. If all pods but one are terminated in [A] and in [B], leaving three running pods, will the CNI return prefix D or keep it around as an available contiguous range? I think this should be logically equivalent to I don't see any obvious value to having a contiguous free /28 available if there are plenty of available IPs in other existing prefixes, which is why I think it should logically behave like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @asheldon, Thanks for your feedback. With this feature(ENABLE_PREFIX_DELEGATION=true), we cannot support In the example which you have given for Yes it is logically equivalent to Agreed on checking the number of IPs and then allocating a prefix if needed, I am working on that optimization but we cannot have fine grained WARM_IP or MINIMUM_IP with this feature. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @jayanthvn. Thank you for the reply. I think nodes with low density packing should require as few as one prefix, and nodes that have high-density packing should have as many prefixes as required, more or less. The way I imagine this would work is different, and basically just treats WARM_IP_TARGET as a signal to IPAMD of how many IPs to keep in reserve, like it is now. WARM_IP_TARGET:
In combination with Nodes still consume a minimum of 1+16 IPs, but this gives cluster operators the ability to hint to the CNI how much buffer they really need around at a level that is lower than a single prefix, or more than a single prefix (but not an exact multiple of 16). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reg this - "the minimum will be two prefixes unless you give up the ability for a node to have more than one prefix worth of pods. Having each node use a minimum of 1+32 IPs makes total IP consumption on the network grow very fast." - The number of pods that can be scheduled on the nodes are not changed, do you mean something like, if we set WARM_PREFIX_TARGET = 1 and one node has 10 pods, then that node will have 1 new prefix + 6 IPs available and other node has just one pod, even then that node will have 1 new prefix + 15 IPs available, something like this? If the pod density per node is not high then WARM_PREFIX_TARGET can be set to 0. Allocating new prefixes are faster than allocating an ENI so mostly shouldn't impact much for the pod launch time.(I need to measure performance yet) I agree this will be a good enhancement, but I don't think we can have it as WARM_IP_TARGET since it will change the definition from what we previously supported. As you mentioned WARM_IP_TARGET is the number of free IPs in datastore and if we support for PD then it will be similar to MINIUM_IP_TARGET because at the minimum a node can have either WARM_IP_TARGET number of IPs or (WARM_IP_TARGET + X) where X can go max up to 16, In the example you mentioned, if 11 are consumed then we will have in the warm pool [ 15 IPs + 6 (WARM_IP_TARGET) ]. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @jayanthvn. Thanks again for your response and your time.
What I've seen in EKS is that when a pod is scheduled on a node that has 0 available IPs in IPAM, the pod gets into a bad state with no IP and has to be manually deleted. This can be triggered with scenarios like WARM_IP_TARGET=1 and scheduling just two pods at the same time. Because of this behavior, I don't feel comfortable running nodes in configurations that might temporarily reach 0 available IPs. This includes WARM_IP_TARGET=0 or WARM_PREFIX_TARGET=0. I did not understand that the design here was to just-in-time allocate an additional prefix when all existing prefixes were exhausted and additional IPs were required, and I don't trust that behavior. I don't want to allocate the additional prefix at the last possible moment, but I also don't want to always allocate an extra 16 IPs. WARM_IP_TARGET is an in-between. I want to allocate the additional prefix when I'm running low on available IPs, like 6 IPs, not 0 or 16.
When WARM_IP_TARGET is set below the maximum number of IPs on a single ENI, it balances the need to start pods rapidly with IP consumption. We can start at least WARM_IP_TARGET pods at any time, but we also don't consume an entire ENI worth of IPs (up to 50!) to make this guarantee. When we use WARM_IP_TARGET or MINIMUM_IP_TARGET today, the CNI determines how many ENIs it needs to meet that need and allocates / deallocates ENIs on-demand. When WARM_IP_TARGET and MINIMUM_IP_TARGET are used in combination with ENABLE_PREFIX_DELEGATION, I believe it should work the same: the CNI determines how many PDs are required to meet these targets and allocates that many PDs. There will be at least WARM_IP_TARGET unused IPs in each PD, but the option may save entire prefixes worth of IP space.
I'm not following you here. If I enable prefix delegation and set MINIMUM_IP_TARGET=36 and WARM_IP_TARGET=4, then I I expect to always have at least three prefixes (to meet MINIMUM_IP_TARGET with 3*16=48 IPs). If I run 45 pods on this node, I would fall to only 3 available IPs, and would expect to allocate a fourth prefix, bringing my usage to 45 pods / 64 IPs. In the prior example, there are 11 pods running and a WARM_IP_TARGET of 6. Since a single PD contains only 16 IPs, there would be only 5 available IPs left and a second PD would be allocated. The minimum number of PDs consumed by the node would be equal to CEIL(WARM_IP_TARGET/16), so the minimum number of IPs consumed would be CEIL(WARM_IP_TARGET/16)*16.
Unless WARM_PREFIX_TARGET is set to 0, it always consumes the same or more IPs than what I am suggesting, but when it it set to 0, it can reach 0 available IPs which is very low. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks a lot Aaron for the detailed explanation. As I mentioned this is a very nice enhancement and would like to take it up. I will open a follow up PR to add this support. This will be a good customer experience. Really appreciate your feedback and time for this :) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
is set to `true`. The current preview release will support a single /28 prefix per ENI hence setting this will cause additional ENIs to be allocated. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
### ENI tags related to Allocation | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to check my understanding, this means that each secondary ENI will consume 16 IPs from its VPC, all of which will be available for use by pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thats right.