Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist IPAM state to local file and use across restarts #972

Merged
merged 1 commit into from
Jun 5, 2020

Conversation

anguslees
Copy link
Contributor

@anguslees anguslees commented May 12, 2020

Persist IPAM state to a file in /var/run (by default), and use this to
recover state across a restart. Note no state needs to be preserved
across a reboot, since all containers are also restarted.

Removes need[*] for docker/CRI and Kubernetes API from ipamd.

[*] But not "use" of docker/CRI :(

  • CRI is necessary for this release to handle upgrades from earlier
    versions - without requiring a reboot of the node. We can
    drop CRI for real in release after release that contains this PR.
  • The CNI K8S_POD_* arguments are still passed in the gRPC request(s)
    to ipamd. It is expected that pod name/namespace will be necessary
    at some point (to fetch pod annotations).

Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a fresh look at this!

pkg/ipamd/rpc_handler.go Outdated Show resolved Hide resolved
pkg/ipamd/rpc_handler.go Outdated Show resolved Hide resolved
pkg/ipamd/datastore/data_store.go Show resolved Hide resolved
@anguslees anguslees force-pushed the dek8s-ipamd branch 3 times, most recently from 37a5a6a to ac2f477 Compare May 19, 2020 08:19
@anguslees anguslees marked this pull request as ready for review May 19, 2020 08:20
@anguslees
Copy link
Contributor Author

I think this is ready for review fwiw.

Open questions:

  • upgrade migration. Assuming we don't want to reboot the node, then we need a way to migrate the existing containerID<->IPs allocations into the new persistent store (or some other strategy).
  • Verify what other CRIs do with CNI ContainerID arg vs K8S_POD_INFRA_CONTAINER_ID. I suspect we should switch to the former.

@anguslees
Copy link
Contributor Author

cc #714

@anguslees
Copy link
Contributor Author

anguslees commented May 22, 2020

Ok, there's now zero references to the CNI K8S_POD_* args in the codebase, and the IPAM state is stored against (network name, container ID, interface name) as recommended by CNI spec.

Also: I've audited the entire internet (dockershim, containerd, cri-o) and verified that every implementation of k8s/CNI sets "Container ID" to the same as "K8S_POD_INFRA_CONTAINER_ID". I asked on #sig-network and found no surprises there either.

I think this leaves:

  • upgrade (without node reboot). I'll have to think about this. It might mean restoring the CRI code for at least another minor release.

@anguslees anguslees force-pushed the dek8s-ipamd branch 2 times, most recently from 49affc4 to 378528f Compare June 4, 2020 02:32
@anguslees
Copy link
Contributor Author

Sigh.

Ok I've added back passing K8S_* from CNI to ipamd, even though they are then ignored. It's highly likely that we'll need to fetch (eg) pod annotations in the future, and this will make that future upgrade easier.

Also upgrade is now done by restoring (some) of the previous CRI code. Now, if the ipamd checkpoint file is missing (ie: not-found -> upgrade or reboot) then we fallback to querying CRI to discover any in-use Pod IPs. In the next release after this we'll be able to remove that and only rely on the checkpoint file (ie: not-found -> reboot only). For future reference, the code that should be removed is: everything covered by BackfillMissingCheckpointFromCRI constant, and the CRI socket references in the k8s manifest.

I have also updated PR description at the top to match current PR.

Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome at a first glance, will take a closer look in the morning.


// Type is the plugin type
Type string `json:"type"`
types.NetConf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +38 to +39
resources: ["eniconfigs"],
verbs: ["get", "list", "watch"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a separate PR... but I'm ok with updating it here. Will resolve #846 I presume.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs-up this comment if you want me to break it out into a separate PR. This whole PR basically started from being surprised by the k8s manifest, and I worked backwards :P

pkg/ipamd/rpc_handler.go Show resolved Hide resolved
Comment on lines +176 to +178
// AssignedIPv4Addresses is the number of IP addresses already assigned
func (p *ENIPool) AssignedIPv4Addresses() int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Slightly confusing to have two functions named AssignedIPv4Addresses(), one on ENIIPPool and one on ENIPool. Even the names of those structs are a bit confusing I think...

Copy link
Contributor Author

@anguslees anguslees Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are a few function names that are duplicated - with the idea that there are three nested levels AddressInfo < ENI (ENIIPPool) < all ENIs (ENIPool) and the methods sum/aggregate the similar method across the lower level.

I completely agree that ENIIPPool and ENIPool are terribly confusing names. I was trying to avoid changing the original type names, even though I knew I should do so :P I will rename ENIIPPool to ENI so it clearly suggests a single ENI (which is what it represents). (Edit: Done.)

pkg/ipamd/datastore/data_store.go Show resolved Hide resolved
Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome and a long due clean up, thanks a lot @anguslees! Will give others a chance to take a look at this as well before merging.

Comment on lines -105 to -106
// AssignedIPv4Addresses is the number of IP addresses already been assigned
AssignedIPv4Addresses int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of this cleans up another of my TODOs... 😄

Comment on lines +300 to +303
// Assume that no file == no containers are
// currently in use, eg a fresh reboot just
// cleared everything out. This is ok, and a
// no-op.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not totally sure that we can ever stop checking the CRI. What if a user removes the file in /var/run and then restart the CNI while pods are running?

Copy link
Contributor Author

@anguslees anguslees Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure .. but at some point we need to stop trying to outcompete malicious users. (What if we continue checking CRI, but they remove the CRI socket or replace kubelet binary with a modified one that does something odd?)

After the upgrade to this CNI, I think the only way to be out of sync is if the kubelet is behaving incorrectly (or deliberate user intervention, as you suggest). In the worst case, a reboot will reset everything (kubelet, CRI, and ipam.json) to empty.

If you're still concerned, we should just drop most of this PR (I'm ok with that). There's no point checkpointing to disk and still querying CRI every time. The entire motivation for this PR was to (eventually) remove the need to expose CRI endpoint to the pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I'd really like to get rid of having to rely on the CRI. It's definitely better not to treat worker nodes as pets. 😄

Comment on lines +309 to +311
if data.Version != CheckpointFormatVersion {
return fmt.Errorf("datastore: unknown backing store format (%s != %s) - wrong CNI/ipamd version? (Rebooting this node will restart local pods and probably help)", data.Version, CheckpointFormatVersion)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rebuild it from the CRI for this case instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that means we're at war with our future selves.

The point of this format version check is so we can introduce a future incompatible change, and make sure that we don't silently ignore whatever that incompatibility is. If we do silently ignore it here, then we'll need to invent another mechanism to make sure we don't. Repeat. (Incidentally, if CRI is sufficient/acceptable, then we never needed the checkpoint in the first place).

Comment on lines +483 to +485
if addr.Assigned() {
panic("addr already assigned")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this shouldn't happen, but should we really panic here and restart the whole pod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can:

  • return an error at runtime, which (eventually) results in the pod restarting as well.
  • ignore it and overwrite an existing pod ip
  • ignore it and incorrectly set up the new pod
  • remove the check so we don't see a panic() in the code, document the expected invariant, and trust that callers never invoke this function for an already assigned ipamKey
  • (current version) assert the expected invariant explicitly in code, and thus know that no callers invoke this function for an already assigned ipamKey.

Which would you like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safest would be to panic for sure, since we should never be in this state. It would give a strong signal that something is very wrong in the current state of the aws-node pod.

@@ -84,7 +84,6 @@ func (c *IPAMContext) setupIntrospectionServer() *http.Server {
serverFunctions := map[string]func(w http.ResponseWriter, r *http.Request){
"/v1/enis": eniV1RequestHandler(c),
"/v1/eni-configs": eniConfigRequestHandler(c),
"/v1/pods": podV1RequestHandler(c),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should update the log collector script as well if we are removing this introspection data:

https://github.com/awslabs/amazon-eks-ami/blob/master/log-collector-script/linux/eks-log-collector.sh#L79

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awslabs/amazon-eks-ami#487 - in draft until this PR is finalised.


// Specify where ipam should persist its current IP<->container allocations.
envBackingStorePath = "AWS_VPC_K8S_CNI_BACKING_STORE"
defaultBackingStorePath = "/var/run/aws-routed-eni/ipam.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was never a fan of the aws-routed-eni folder name for logs, but I can see why you would like to keep them the same. That said, we already have a complete mix of names in both binaries and folders:aws-cni for the CNI, aws-k8s-agent for ipamd, aws-node for the daemonset...

Is /var/run/aws-node/ipamd.json better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@mogren mogren requested review from jaypipes and SaranBalaji90 June 5, 2020 04:01
Persist IPAM state to a file in /var/run (by default), and use this to
recover state across a restart.  Note no state needs to be preserved
across a reboot, since all containers are also restarted.

Removes need[*] for docker/CRI and Kubernetes API from ipamd.

[*] But not "use" of docker/CRI :(
- CRI is necessary for this release to handle upgrades from earlier
  versions - without requiring a reboot of the node.  We can
  drop CRI for real in release after release that contains this PR.
- The CNI K8S_POD_* arguments are still passed in the gRPC request(s)
  to ipamd.  It is expected that pod name/namespace _will_ be necessary
  at some point (to fetch pod annotations).
anguslees added a commit to anguslees/amazon-eks-ami that referenced this pull request Jun 5, 2020
L-IPAMD now checkpoints IPAM state to a file.  Collect
/var/run/aws-node/ipam.json file (if present) for debugging.

See also aws/amazon-vpc-cni-k8s#972
Comment on lines -48 to -50
discoverController := k8sapi.NewController(kubeClient)
go discoverController.DiscoverLocalK8SPods()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of this, and just reading it from the CRI socket will resolve #711 and replace #738 as well...

Copy link
Contributor

@jayanthvn jayanthvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good :)

mogren pushed a commit to awslabs/amazon-eks-ami that referenced this pull request Jun 8, 2020
L-IPAMD now checkpoints IPAM state to a file.  Collect
/var/run/aws-node/ipam.json file (if present) for debugging.

See also aws/amazon-vpc-cni-k8s#972
@uthark
Copy link
Contributor

uthark commented Jun 8, 2020

@mogren will you cherry-pick it to 1.6.x branch?

@mogren
Copy link
Contributor

mogren commented Jun 8, 2020

@uthark I think this change is a bit too big for a patch release, so instead planning to get #903 updated and merged, then start preparing a v1.7.0-rc1 release for testing. The config files are all updated because of #955 and #986, so a lot of changes coming.

@mogren mogren added this to the v1.7.0 milestone Jun 24, 2020
@anguslees anguslees deleted the dek8s-ipamd branch July 22, 2020 08:05
cryptosguru added a commit to cryptosguru/amazon that referenced this pull request May 9, 2022
L-IPAMD now checkpoints IPAM state to a file.  Collect
/var/run/aws-node/ipam.json file (if present) for debugging.

See also aws/amazon-vpc-cni-k8s#972
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants