-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update rotating secrets docs #8948
Conversation
You need to reboot every node (using a rolling-update). You have to use `--cloudonly` because the keypair no longer matches. | ||
|
||
``` | ||
kops rolling-update cluster --cloudonly --force --yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested these instructions, but does the entire cluster need to be rolled or only the masters? Since the apiserver will essentially be down from the time you start deleting keys until the majority of them are rebooted, perhaps keeping the --master-interval=10s
there would help minimize downtime.
If nodes do need to be replaced as well, perhaps we break it out into two rolling-update commands with only the masters using --cloudonly and the nodes being more graceful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lost access to the API when I deleted all the secrets. I think all the nodes need to download the new certs and re-register to the cluster. I don't know if a reboot would do that, which could potentially be less disruptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so even after replacing just the masters, both you and the nodes still couldn't connect to the API. in that case it probably makes sense to rolling-update every master and node in the cluster all in one go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps keep the --master-interval=10s
? no point in rolling those slowly if they're inaccessible. I do see an advantage of keeping the default node-interval, since it allows pods to be replaced gradually rather than having all of the nodes down at once. we wouldn't respect pod disruption budgets but at least 1 nodes' pods being lost at a time is better than all. Maybe we keep --master-interval=10s
and mention that users may want to set a shorter --node-interval depending on their workloads and tolerance of downtime. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would at least not keep the master interval. The masters are not really doing anything useful: dns, api server etc will all be broken. There is nothing on the masters that work without the API and the API won't work until etcd is working again. I think you want to roll as fast as possible so new nodes can register (which requires quorum).
Masters typically take 5 min to roll. Rolling nodes faster than that won't really work since the nodes won't register with the masters until they are all up.
Nodes typically take 2 min to roll. We could say that it may be wise to wait 2-3 min to roll the nodes to minimize workload downtime.
Until we find a way to rotate this more automatic, I think one have to live with quite a bit of downtime here anyway though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry just to clarify, are you proposing the docs should include --master-interval=10s
or to omit it and let it use the kops default? I think it would make sense to roll the masters as fast as possible since none of them are functional until the majority of them have been replaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think from user perspective it is easiest to either roll everything at once. Depending on workload one may want to do things differently, but you'd save 2-3 min out of maybe 10 min downtime. So I would leave it as-is in the docs.
It would be a fun exercise to look at how we can automate and do this more gracefully though.
By not rolling ca and keeping that key more safe than the other cert (maybe not even having it in S3) would save a lot of pain since the masters should accept the new certs immediately. Don't think you would have to roll all tokens this way either.
pkill -f kube-controller-manager | ||
kubectl delete pods --all --all-namespaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the old pods still functional? I suppose any that depend on their service tokens no longer work, but other pods would still be functional. I'm mostly thinking about how we can be as least disruptive as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All pods that are not using the SA token is functional.
But we just rolled --cloudonly
the entire cluster, so I figured this would be easier and faster than the user having to look for pods that may be using the API or using the token for other purposes.
ok, this looks good and if we can figure out a less invasive way to perform the rotating we can always update this later. thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: olemarkus, rifelpet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
No description provided.