Update rotating secrets docs #8948

olemarkus · 2020-04-21T07:01:02Z

No description provided.

rifelpet · 2020-04-21T11:45:11Z

docs/rotate-secrets.md

+You need to reboot every node (using a rolling-update). You have to use `--cloudonly` because the keypair no longer matches.
+
+```
+kops rolling-update cluster --cloudonly --force --yes


I haven't tested these instructions, but does the entire cluster need to be rolled or only the masters? Since the apiserver will essentially be down from the time you start deleting keys until the majority of them are rebooted, perhaps keeping the --master-interval=10s there would help minimize downtime.

If nodes do need to be replaced as well, perhaps we break it out into two rolling-update commands with only the masters using --cloudonly and the nodes being more graceful.

I lost access to the API when I deleted all the secrets. I think all the nodes need to download the new certs and re-register to the cluster. I don't know if a reboot would do that, which could potentially be less disruptive.

Ah so even after replacing just the masters, both you and the nodes still couldn't connect to the API. in that case it probably makes sense to rolling-update every master and node in the cluster all in one go.

Perhaps keep the --master-interval=10s ? no point in rolling those slowly if they're inaccessible. I do see an advantage of keeping the default node-interval, since it allows pods to be replaced gradually rather than having all of the nodes down at once. we wouldn't respect pod disruption budgets but at least 1 nodes' pods being lost at a time is better than all. Maybe we keep --master-interval=10s and mention that users may want to set a shorter --node-interval depending on their workloads and tolerance of downtime. Thoughts?

I would at least not keep the master interval. The masters are not really doing anything useful: dns, api server etc will all be broken. There is nothing on the masters that work without the API and the API won't work until etcd is working again. I think you want to roll as fast as possible so new nodes can register (which requires quorum).

Masters typically take 5 min to roll. Rolling nodes faster than that won't really work since the nodes won't register with the masters until they are all up.

Nodes typically take 2 min to roll. We could say that it may be wise to wait 2-3 min to roll the nodes to minimize workload downtime.

Until we find a way to rotate this more automatic, I think one have to live with quite a bit of downtime here anyway though.

sorry just to clarify, are you proposing the docs should include --master-interval=10s or to omit it and let it use the kops default? I think it would make sense to roll the masters as fast as possible since none of them are functional until the majority of them have been replaced.

I think from user perspective it is easiest to either roll everything at once. Depending on workload one may want to do things differently, but you'd save 2-3 min out of maybe 10 min downtime. So I would leave it as-is in the docs.
It would be a fun exercise to look at how we can automate and do this more gracefully though.
By not rolling ca and keeping that key more safe than the other cert (maybe not even having it in S3) would save a lot of pain since the masters should accept the new certs immediately. Don't think you would have to roll all tokens this way either.

rifelpet · 2020-04-21T11:47:20Z

docs/rotate-secrets.md

 pkill -f kube-controller-manager
+kubectl delete pods --all --all-namespaces


Are the old pods still functional? I suppose any that depend on their service tokens no longer work, but other pods would still be functional. I'm mostly thinking about how we can be as least disruptive as necessary.

All pods that are not using the SA token is functional.

But we just rolled --cloudonly the entire cluster, so I figured this would be easier and faster than the user having to look for pods that may be using the API or using the token for other purposes.

rifelpet · 2020-04-22T15:45:24Z

ok, this looks good and if we can figure out a less invasive way to perform the rotating we can always update this later. thanks!
/lgtm
/approve

k8s-ci-robot · 2020-04-22T15:45:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rifelpet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Update rotating secrets docs

14f0d0c

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 21, 2020

k8s-ci-robot requested review from joshbranham and robinpercy April 21, 2020 07:01

k8s-ci-robot added the area/documentation label Apr 21, 2020

olemarkus mentioned this pull request Apr 21, 2020

Provide full secret rotation #1020

Closed

rifelpet reviewed Apr 21, 2020

View reviewed changes

olemarkus mentioned this pull request Apr 22, 2020

Rotate cluster admin password steps #8919

Closed

k8s-ci-robot assigned rifelpet Apr 22, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2020

k8s-ci-robot merged commit b62c01f into kubernetes:master Apr 22, 2020

k8s-ci-robot added this to the v1.18 milestone Apr 22, 2020

olemarkus deleted the docs-secret-rotate branch April 25, 2020 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update rotating secrets docs #8948

Update rotating secrets docs #8948

olemarkus commented Apr 21, 2020

rifelpet Apr 21, 2020

olemarkus Apr 21, 2020

rifelpet Apr 21, 2020

rifelpet Apr 21, 2020

olemarkus Apr 21, 2020

rifelpet Apr 21, 2020

olemarkus Apr 21, 2020

rifelpet Apr 21, 2020

olemarkus Apr 21, 2020

rifelpet commented Apr 22, 2020

k8s-ci-robot commented Apr 22, 2020

		pkill -f kube-controller-manager
		kubectl delete pods --all --all-namespaces

Update rotating secrets docs #8948

Update rotating secrets docs #8948

Conversation

olemarkus commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rifelpet commented Apr 22, 2020

k8s-ci-robot commented Apr 22, 2020