ETCD backup/restore & cluster upgrade #1805

jefflill · 2023-06-20T15:19:34Z

Some thoughts and links for these topics.

I was out on a drive yesterday and pulled over to do some research on my phone, looking into ETCD backup/restore solutions to make having just a single control-plane node more resilient in the cloud. This looks very possible using the etcdctl CLI. We could do a full backup to S3 (etc) every hour and log transactions in the meantime, so S3 should be very close to being up to date at all times.

Then if the cloud relocates the VM to a new host and there's a problem with the ETCD data (or it gets corrupted some other way), we could reload the ETCD data. We'd need to start/stop ETCD (and probably the API server) while we do this but this should only be for a minute or two and whatever is currently running on the cluster will still run, so most user facing services shouldn't see much impact.

We might need to do something similar when need to upgrade ETCD in the future. I did some reading about that too. ETCD does support upgrades but you need to install every version of ETCD between what you have and where you want to be eventually, so that's a pain. So the best approach might be to:

shutdown the API servers on all masters
backup ETCD on each of the masters
upgrade ETCD with no data
restore the backup
restart the API servers

Here are some links discussing this:

https://goteleport.com/blog/kubernetes-and-offline-etcd-upgrades/
https://github.com/etcd-io/etcd/blob/main/etcdctl/README.md

jefflill added the neon-kube Related to our Kubernetes distribution label Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCD backup/restore & cluster upgrade #1805

ETCD backup/restore & cluster upgrade #1805

jefflill commented Jun 20, 2023

ETCD backup/restore & cluster upgrade #1805

ETCD backup/restore & cluster upgrade #1805

Comments

jefflill commented Jun 20, 2023