Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PodDisruptionBudget configurable #547

Closed
multi-io opened this issue Apr 23, 2019 · 14 comments · Fixed by #583
Closed

Make PodDisruptionBudget configurable #547

multi-io opened this issue Apr 23, 2019 · 14 comments · Fixed by #583

Comments

@multi-io
Copy link

multi-io commented Apr 23, 2019

Unless I'm missing something here, the operator invariably installs for every created cluster a PodDisruptionBudget (pdb) that requires one master to be available at all times (minRequired=1). That prevents any evictions of the master pod, e.g. during node updates or reboots. Since Patroni should handle a master outage by electing another (assuming that clients can handle the short unavailability of write access to the cluster), I'd suggest making the PDB configurable or entirely optional.

@multi-io multi-io changed the title PodDisruptionBudget prevents master pod migrations PodDisruptionBudget prevents master pod evictions Apr 23, 2019
@alexeyklyukin
Copy link
Contributor

The idea behind requiring at least one master via the Pod Disruption Budget is to avoid multiple failover during rolling upgrades: you want Patroni to handle the nodes update and switchover and not the K8S cluster manager to start killing you master nodes. The need to reconnect is a lesser evil; the bigger issue to avoid is multiple failovers for one cluster throughout the rolling update (that happens if you new primary elected by Patroni after the abrupt killing of the old one appears on the old node).

@multi-io
Copy link
Author

multi-io commented Apr 23, 2019

OK so how would you evict the master pod, e.g. when draining the node it's running on? Would you first actively call Patroni to elect another master?

@FxKu
Copy link
Member

FxKu commented Apr 24, 2019

I did some test with kind recently and there were two ways how to evict the master pod:

  1. Taint-based eviction using kubectl taint nodes <node-with-master> postgres=:NoExecute
  2. Mark node as unschedulable with kubectl cordon <node-with-master> and remove the lifecycle-status label. But for this, I had to define node_readiness_label: "lifecycle-status:ready" in the configmap.yaml first and then add the label to my nodes.

In both cases, it failed over to a replica on another node.

@alexeyklyukin
Copy link
Contributor

Yes, the idea when this functionality has been implemented was to use the second option; you could, theoretically, just mark the nodes unschedulable, but that would make the operator move multiple pods at the same time without the guarantee that there will be corresponding new nodes for the moved master pods to land to.

Instead, the node readiness label is removed first from the old nodes, forcing the spilo pods to ignore them; next, nodes are marked as unschedulable one by one; when the operator finds that the node is both without the readiness label and unschedulable, it starts moving master pods out of it by first finding a replica on the 'ready' node or picking up one from the old ones and terminating it, so that it ends up on the new node, and then doing switchover to that replica.

@containerpope
Copy link

@alexeyklyukin How would you recommend to do this for automatic updates of clusters. They tend to just drain the nodes and then update them. Seen this behavior at AKS from Azure and Kubespray.

Is there a recommendation what to do with this kind of automated installers? (AKS update process can't be influenced)

@alexeyklyukin
Copy link
Contributor

@mkjoerg are they doing it for all nodes at once? I don't think there is a reasonable expectation to achieve what the operator update is aiming for (minimize the downtime and the number of failover) without the cooperation from the cluster update infrastructure. In this case I'd suggest either accepting the downtime or disabling automatic updates. Perhaps in the first case we want to make PDBs optional.

@containerpope
Copy link

@alexeyklyukin In AKS, they do it one by one. First, they drain the node, then they recreate it as far as I can see.
Last week I tested an upgrade of one of my on-premise Kubespray installed clusters and they are basically following the same procedure. First, they just drain the node and then they update and restart it.

https://github.com/kubernetes-sigs/kubespray/blob/master/docs/upgrades.md

While executing this flow, I ran into issues, because having a Pod Disruption Budget which is targetting only 1 Pod, that requires at least one Pod, such processes won't be allowed from Kubernetes as expected, because it will hurt the PDB.
Maybe I am just overlooking something here?

@Jan-M
Copy link
Member

Jan-M commented May 14, 2019

As stated above the operator does not act just on drain, but also on the label change.

If other platforms all rely on the drain alone, probably this would make a good feature toggle to respond on drain to trigger patroni or simply to disable the pod distruption budget. But the budget has other good reasons, e.g. preferable not downscale nodes with static services (Postgres).

@Jan-M
Copy link
Member

Jan-M commented May 14, 2019

But to add, given that you do the full worker node upgrade, the operator and the label together, give you the benefit of less failovers, as the new replica pod is spawned on an already upgraded node. That is the idea (as also already outlined here). Otherwise you get up to n failovers, depending on where your replia is.

@containerpope
Copy link

I think fewer failovers are always a good thing. For my Kubespray process, I would be able to adjust the process regarding the needs of the operator. The problem is with my AKS clusters I would never be able to upgrade the cluster without messing with the Operator because there it is not possible to interfere with the update process.

So maybe a configuration toggle in the config CRD or ConfigMap of the operator that allows disabling and enabling of the PDB (which leads to removal or creation) of them could solve this issue? Otherwise, there would be a need for a flag to disable them at all and to live with the failovers.

In case one it would then be able to remove the PDBs save in case of cluster update and just switch them back in later on.

@alexeyklyukin
Copy link
Contributor

In AKS, they do it one by one. First, they drain the node, then they recreate it as far as I can see.

That's better, now the question is whether you can tell the old node from the new one. This is necessary in order to move Postgres pods from the node that is about to be migrated to the new node, avoiding repetitions of this process. The process coded into the operator relies on labels to distinguish between the old and new nodes, perhaps it is possible to label new nodes after the upgrade (maybe the ansible playbook referenced in the link you gave can be modified for this).

While executing this flow, I ran into issues, because having a Pod Disruption Budget which is targetting only 1 Pod, that requires at least one Pod, such processes won't be allowed from Kubernetes as expected, because it will hurt the PDB.

Nope, that is intended, the idea is to avoid taking the node down until the operator takes care of moving all primary Postgres pods from it. Obviously, it relies on the upgrade process being somehow compatible with what the operator is designed for, otherwise PDB will just come as an obstacle as in your case.

I see a good reason for disabling PDB to handle platforms that does not allow customizing the upgrade process.

@containerpope
Copy link

That's better, now the question is whether you can tell the old node from the new one. This is necessary in order to move Postgres pods from the node that is about to be migrated to the new node, avoiding repetitions of this process. The process coded into the operator relies on labels to distinguish between the old and new nodes, perhaps it is possible to label new nodes after the upgrade (maybe the ansible playbook referenced in the link you gave can be modified for this).

In the ansible flow this would be possible but in my opinion should be avoided, because each time they release a new installer version, this change must be migrated. In case of the Azure AKS flow, this is impossible, because you can just order a new version and they do the update for you. So in the managed clusters, you are unable to adjust anything related to the processes.

Nope, that is intended, the idea is to avoid taking the node down until the operator takes care of moving all primary Postgres pods from it. Obviously, it relies on the upgrade process being somehow compatible with what the operator is designed for, otherwise PDB will just come as an obstacle as in your case.

This makes perfect sense for me.

I see a good reason for disabling PDB to handle platforms that does not allow customizing the upgrade process.

I think the best would be to be able to disable and enable them on demand, so that you can just swap them out for a time window where you will execute the update and later on you can enable them by config, to obtain all the features it provides.
If this is too complicated to achieve, because it would require a kind of watchdog extra for these, maybe being able to disable them completely in a managed cluster would make sense.

@FxKu FxKu changed the title PodDisruptionBudget prevents master pod evictions Make PodDisruptionBudget configurable May 17, 2019
@FxKu FxKu closed this as completed in #583 Jun 18, 2019
@containerpope
Copy link

@FxKu is this feature already in the acid/postgres-operator:v1.1.0-28-g24d412a image that is installed by the master version of the helm chart, or is there another version that allows this?

@bitpavel
Copy link

bitpavel commented Dec 25, 2020

My use case: trying to manage DB resource requests with VPA (https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler). Replica pod is updated successfully, but master pod can't be modified due to PDB.

The proposed PR allows to disable PDB completely, but it's a pretty radical solution IMO. If there was a possibility just to remove spilo-role: master from matchLabels, it would be a better alternative. Thus a cluster with multiple pods could be migrated/evicted, pods could be updated progressively, while minimizing a comlete DB service disruptions.
Currently such behaviour can be reached by setting enable_pod_disruption_budget to false and adding a custom PDB with no spilo-role constraint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants