-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make PodDisruptionBudget configurable #547
Comments
The idea behind requiring at least one master via the Pod Disruption Budget is to avoid multiple failover during rolling upgrades: you want Patroni to handle the nodes update and switchover and not the K8S cluster manager to start killing you master nodes. The need to reconnect is a lesser evil; the bigger issue to avoid is multiple failovers for one cluster throughout the rolling update (that happens if you new primary elected by Patroni after the abrupt killing of the old one appears on the old node). |
OK so how would you evict the master pod, e.g. when draining the node it's running on? Would you first actively call Patroni to elect another master? |
I did some test with kind recently and there were two ways how to evict the master pod:
In both cases, it failed over to a replica on another node. |
Yes, the idea when this functionality has been implemented was to use the second option; you could, theoretically, just mark the nodes unschedulable, but that would make the operator move multiple pods at the same time without the guarantee that there will be corresponding new nodes for the moved master pods to land to. Instead, the node readiness label is removed first from the old nodes, forcing the spilo pods to ignore them; next, nodes are marked as unschedulable one by one; when the operator finds that the node is both without the readiness label and unschedulable, it starts moving master pods out of it by first finding a replica on the 'ready' node or picking up one from the old ones and terminating it, so that it ends up on the new node, and then doing switchover to that replica. |
@alexeyklyukin How would you recommend to do this for automatic updates of clusters. They tend to just drain the nodes and then update them. Seen this behavior at AKS from Azure and Kubespray. Is there a recommendation what to do with this kind of automated installers? (AKS update process can't be influenced) |
@mkjoerg are they doing it for all nodes at once? I don't think there is a reasonable expectation to achieve what the operator update is aiming for (minimize the downtime and the number of failover) without the cooperation from the cluster update infrastructure. In this case I'd suggest either accepting the downtime or disabling automatic updates. Perhaps in the first case we want to make PDBs optional. |
@alexeyklyukin In AKS, they do it one by one. First, they drain the node, then they recreate it as far as I can see. https://github.com/kubernetes-sigs/kubespray/blob/master/docs/upgrades.md While executing this flow, I ran into issues, because having a Pod Disruption Budget which is targetting only 1 Pod, that requires at least one Pod, such processes won't be allowed from Kubernetes as expected, because it will hurt the PDB. |
As stated above the operator does not act just on drain, but also on the label change. If other platforms all rely on the drain alone, probably this would make a good feature toggle to respond on drain to trigger patroni or simply to disable the pod distruption budget. But the budget has other good reasons, e.g. preferable not downscale nodes with static services (Postgres). |
But to add, given that you do the full worker node upgrade, the operator and the label together, give you the benefit of less failovers, as the new replica pod is spawned on an already upgraded node. That is the idea (as also already outlined here). Otherwise you get up to n failovers, depending on where your replia is. |
I think fewer failovers are always a good thing. For my Kubespray process, I would be able to adjust the process regarding the needs of the operator. The problem is with my AKS clusters I would never be able to upgrade the cluster without messing with the Operator because there it is not possible to interfere with the update process. So maybe a configuration toggle in the config CRD or ConfigMap of the operator that allows disabling and enabling of the PDB (which leads to removal or creation) of them could solve this issue? Otherwise, there would be a need for a flag to disable them at all and to live with the failovers. In case one it would then be able to remove the PDBs save in case of cluster update and just switch them back in later on. |
That's better, now the question is whether you can tell the old node from the new one. This is necessary in order to move Postgres pods from the node that is about to be migrated to the new node, avoiding repetitions of this process. The process coded into the operator relies on labels to distinguish between the old and new nodes, perhaps it is possible to label new nodes after the upgrade (maybe the ansible playbook referenced in the link you gave can be modified for this).
Nope, that is intended, the idea is to avoid taking the node down until the operator takes care of moving all primary Postgres pods from it. Obviously, it relies on the upgrade process being somehow compatible with what the operator is designed for, otherwise PDB will just come as an obstacle as in your case. I see a good reason for disabling PDB to handle platforms that does not allow customizing the upgrade process. |
In the ansible flow this would be possible but in my opinion should be avoided, because each time they release a new installer version, this change must be migrated. In case of the Azure AKS flow, this is impossible, because you can just order a new version and they do the update for you. So in the managed clusters, you are unable to adjust anything related to the processes.
This makes perfect sense for me.
I think the best would be to be able to disable and enable them on demand, so that you can just swap them out for a time window where you will execute the update and later on you can enable them by config, to obtain all the features it provides. |
@FxKu is this feature already in the acid/postgres-operator:v1.1.0-28-g24d412a image that is installed by the master version of the helm chart, or is there another version that allows this? |
My use case: trying to manage DB resource requests with VPA (https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler). Replica pod is updated successfully, but master pod can't be modified due to PDB. The proposed PR allows to disable PDB completely, but it's a pretty radical solution IMO. If there was a possibility just to remove |
Unless I'm missing something here, the operator invariably installs for every created cluster a PodDisruptionBudget (pdb) that requires one master to be available at all times (minRequired=1). That prevents any evictions of the master pod, e.g. during node updates or reboots. Since Patroni should handle a master outage by electing another (assuming that clients can handle the short unavailability of write access to the cluster), I'd suggest making the PDB configurable or entirely optional.
The text was updated successfully, but these errors were encountered: