-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist IPAM state to local file and use across restarts #972
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a fresh look at this!
37a5a6a
to
ac2f477
Compare
I think this is ready for review fwiw. Open questions:
|
cc #714 |
Ok, there's now zero references to the CNI Also: I've audited the entire internet (dockershim, containerd, cri-o) and verified that every implementation of k8s/CNI sets "Container ID" to the same as "K8S_POD_INFRA_CONTAINER_ID". I asked on #sig-network and found no surprises there either. I think this leaves:
|
49affc4
to
378528f
Compare
Sigh. Ok I've added back passing K8S_* from CNI to ipamd, even though they are then ignored. It's highly likely that we'll need to fetch (eg) pod annotations in the future, and this will make that future upgrade easier. Also upgrade is now done by restoring (some) of the previous CRI code. Now, if the ipamd checkpoint file is missing (ie: not-found -> upgrade or reboot) then we fallback to querying CRI to discover any in-use Pod IPs. In the next release after this we'll be able to remove that and only rely on the checkpoint file (ie: not-found -> reboot only). For future reference, the code that should be removed is: everything covered by I have also updated PR description at the top to match current PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome at a first glance, will take a closer look in the morning.
|
||
// Type is the plugin type | ||
Type string `json:"type"` | ||
types.NetConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
resources: ["eniconfigs"], | ||
verbs: ["get", "list", "watch"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a separate PR... but I'm ok with updating it here. Will resolve #846 I presume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thumbs-up this comment if you want me to break it out into a separate PR. This whole PR basically started from being surprised by the k8s manifest, and I worked backwards :P
// AssignedIPv4Addresses is the number of IP addresses already assigned | ||
func (p *ENIPool) AssignedIPv4Addresses() int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Slightly confusing to have two functions named AssignedIPv4Addresses()
, one on ENIIPPool
and one on ENIPool
. Even the names of those structs are a bit confusing I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There are a few function names that are duplicated - with the idea that there are three nested levels AddressInfo
< ENI (ENIIPPool
) < all ENIs (ENIPool
) and the methods sum/aggregate the similar method across the lower level.
I completely agree that ENIIPPool and ENIPool are terribly confusing names. I was trying to avoid changing the original type names, even though I knew I should do so :P I will rename ENIIPPool
to ENI
so it clearly suggests a single ENI (which is what it represents). (Edit: Done.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome and a long due clean up, thanks a lot @anguslees! Will give others a chance to take a look at this as well before merging.
// AssignedIPv4Addresses is the number of IP addresses already been assigned | ||
AssignedIPv4Addresses int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting rid of this cleans up another of my TODOs... 😄
// Assume that no file == no containers are | ||
// currently in use, eg a fresh reboot just | ||
// cleared everything out. This is ok, and a | ||
// no-op. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not totally sure that we can ever stop checking the CRI. What if a user removes the file in /var/run and then restart the CNI while pods are running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure .. but at some point we need to stop trying to outcompete malicious users. (What if we continue checking CRI, but they remove the CRI socket or replace kubelet binary with a modified one that does something odd?)
After the upgrade to this CNI, I think the only way to be out of sync is if the kubelet is behaving incorrectly (or deliberate user intervention, as you suggest). In the worst case, a reboot will reset everything (kubelet, CRI, and ipam.json) to empty.
If you're still concerned, we should just drop most of this PR (I'm ok with that). There's no point checkpointing to disk and still querying CRI every time. The entire motivation for this PR was to (eventually) remove the need to expose CRI endpoint to the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I'd really like to get rid of having to rely on the CRI. It's definitely better not to treat worker nodes as pets. 😄
if data.Version != CheckpointFormatVersion { | ||
return fmt.Errorf("datastore: unknown backing store format (%s != %s) - wrong CNI/ipamd version? (Rebooting this node will restart local pods and probably help)", data.Version, CheckpointFormatVersion) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rebuild it from the CRI for this case instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that means we're at war with our future selves.
The point of this format version check is so we can introduce a future incompatible change, and make sure that we don't silently ignore whatever that incompatibility is. If we do silently ignore it here, then we'll need to invent another mechanism to make sure we don't. Repeat. (Incidentally, if CRI is sufficient/acceptable, then we never needed the checkpoint in the first place).
if addr.Assigned() { | ||
panic("addr already assigned") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this shouldn't happen, but should we really panic here and restart the whole pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can:
- return an error at runtime, which (eventually) results in the pod restarting as well.
- ignore it and overwrite an existing pod ip
- ignore it and incorrectly set up the new pod
- remove the check so we don't see a
panic()
in the code, document the expected invariant, and trust that callers never invoke this function for an already assigned ipamKey - (current version) assert the expected invariant explicitly in code, and thus know that no callers invoke this function for an already assigned ipamKey.
Which would you like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safest would be to panic for sure, since we should never be in this state. It would give a strong signal that something is very wrong in the current state of the aws-node pod.
@@ -84,7 +84,6 @@ func (c *IPAMContext) setupIntrospectionServer() *http.Server { | |||
serverFunctions := map[string]func(w http.ResponseWriter, r *http.Request){ | |||
"/v1/enis": eniV1RequestHandler(c), | |||
"/v1/eni-configs": eniConfigRequestHandler(c), | |||
"/v1/pods": podV1RequestHandler(c), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should update the log collector script as well if we are removing this introspection data:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awslabs/amazon-eks-ami#487 - in draft until this PR is finalised.
pkg/ipamd/ipamd.go
Outdated
|
||
// Specify where ipam should persist its current IP<->container allocations. | ||
envBackingStorePath = "AWS_VPC_K8S_CNI_BACKING_STORE" | ||
defaultBackingStorePath = "/var/run/aws-routed-eni/ipam.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was never a fan of the aws-routed-eni
folder name for logs, but I can see why you would like to keep them the same. That said, we already have a complete mix of names in both binaries and folders:aws-cni
for the CNI, aws-k8s-agent
for ipamd, aws-node
for the daemonset...
Is /var/run/aws-node/ipamd.json
better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
Persist IPAM state to a file in /var/run (by default), and use this to recover state across a restart. Note no state needs to be preserved across a reboot, since all containers are also restarted. Removes need[*] for docker/CRI and Kubernetes API from ipamd. [*] But not "use" of docker/CRI :( - CRI is necessary for this release to handle upgrades from earlier versions - without requiring a reboot of the node. We can drop CRI for real in release after release that contains this PR. - The CNI K8S_POD_* arguments are still passed in the gRPC request(s) to ipamd. It is expected that pod name/namespace _will_ be necessary at some point (to fetch pod annotations).
L-IPAMD now checkpoints IPAM state to a file. Collect /var/run/aws-node/ipam.json file (if present) for debugging. See also aws/amazon-vpc-cni-k8s#972
discoverController := k8sapi.NewController(kubeClient) | ||
go discoverController.DiscoverLocalK8SPods() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting rid of this, and just reading it from the CRI socket will resolve #711 and replace #738 as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good :)
L-IPAMD now checkpoints IPAM state to a file. Collect /var/run/aws-node/ipam.json file (if present) for debugging. See also aws/amazon-vpc-cni-k8s#972
@mogren will you cherry-pick it to 1.6.x branch? |
L-IPAMD now checkpoints IPAM state to a file. Collect /var/run/aws-node/ipam.json file (if present) for debugging. See also aws/amazon-vpc-cni-k8s#972
Persist IPAM state to a file in /var/run (by default), and use this to
recover state across a restart. Note no state needs to be preserved
across a reboot, since all containers are also restarted.
Removes need[*] for docker/CRI and Kubernetes API from ipamd.
[*] But not "use" of docker/CRI :(
versions - without requiring a reboot of the node. We can
drop CRI for real in release after release that contains this PR.
to ipamd. It is expected that pod name/namespace will be necessary
at some point (to fetch pod annotations).