Make PodDisruptionBudget configurable #547

multi-io · 2019-04-23T16:11:54Z

Unless I'm missing something here, the operator invariably installs for every created cluster a PodDisruptionBudget (pdb) that requires one master to be available at all times (minRequired=1). That prevents any evictions of the master pod, e.g. during node updates or reboots. Since Patroni should handle a master outage by electing another (assuming that clients can handle the short unavailability of write access to the cluster), I'd suggest making the PDB configurable or entirely optional.

alexeyklyukin · 2019-04-23T17:06:53Z

The idea behind requiring at least one master via the Pod Disruption Budget is to avoid multiple failover during rolling upgrades: you want Patroni to handle the nodes update and switchover and not the K8S cluster manager to start killing you master nodes. The need to reconnect is a lesser evil; the bigger issue to avoid is multiple failovers for one cluster throughout the rolling update (that happens if you new primary elected by Patroni after the abrupt killing of the old one appears on the old node).

multi-io · 2019-04-23T17:22:12Z

OK so how would you evict the master pod, e.g. when draining the node it's running on? Would you first actively call Patroni to elect another master?

FxKu · 2019-04-24T16:08:06Z

I did some test with kind recently and there were two ways how to evict the master pod:

Taint-based eviction using kubectl taint nodes <node-with-master> postgres=:NoExecute
Mark node as unschedulable with kubectl cordon <node-with-master> and remove the lifecycle-status label. But for this, I had to define node_readiness_label: "lifecycle-status:ready" in the configmap.yaml first and then add the label to my nodes.

In both cases, it failed over to a replica on another node.

alexeyklyukin · 2019-04-24T17:27:39Z

Yes, the idea when this functionality has been implemented was to use the second option; you could, theoretically, just mark the nodes unschedulable, but that would make the operator move multiple pods at the same time without the guarantee that there will be corresponding new nodes for the moved master pods to land to.

Instead, the node readiness label is removed first from the old nodes, forcing the spilo pods to ignore them; next, nodes are marked as unschedulable one by one; when the operator finds that the node is both without the readiness label and unschedulable, it starts moving master pods out of it by first finding a replica on the 'ready' node or picking up one from the old ones and terminating it, so that it ends up on the new node, and then doing switchover to that replica.

containerpope · 2019-05-14T05:41:37Z

@alexeyklyukin How would you recommend to do this for automatic updates of clusters. They tend to just drain the nodes and then update them. Seen this behavior at AKS from Azure and Kubespray.

Is there a recommendation what to do with this kind of automated installers? (AKS update process can't be influenced)

alexeyklyukin · 2019-05-14T08:56:12Z

@mkjoerg are they doing it for all nodes at once? I don't think there is a reasonable expectation to achieve what the operator update is aiming for (minimize the downtime and the number of failover) without the cooperation from the cluster update infrastructure. In this case I'd suggest either accepting the downtime or disabling automatic updates. Perhaps in the first case we want to make PDBs optional.

containerpope · 2019-05-14T09:03:51Z

@alexeyklyukin In AKS, they do it one by one. First, they drain the node, then they recreate it as far as I can see.
Last week I tested an upgrade of one of my on-premise Kubespray installed clusters and they are basically following the same procedure. First, they just drain the node and then they update and restart it.

https://github.com/kubernetes-sigs/kubespray/blob/master/docs/upgrades.md

While executing this flow, I ran into issues, because having a Pod Disruption Budget which is targetting only 1 Pod, that requires at least one Pod, such processes won't be allowed from Kubernetes as expected, because it will hurt the PDB.
Maybe I am just overlooking something here?

Jan-M · 2019-05-14T09:09:57Z

As stated above the operator does not act just on drain, but also on the label change.

If other platforms all rely on the drain alone, probably this would make a good feature toggle to respond on drain to trigger patroni or simply to disable the pod distruption budget. But the budget has other good reasons, e.g. preferable not downscale nodes with static services (Postgres).

Jan-M · 2019-05-14T09:11:45Z

But to add, given that you do the full worker node upgrade, the operator and the label together, give you the benefit of less failovers, as the new replica pod is spawned on an already upgraded node. That is the idea (as also already outlined here). Otherwise you get up to n failovers, depending on where your replia is.

containerpope · 2019-05-14T10:07:09Z

I think fewer failovers are always a good thing. For my Kubespray process, I would be able to adjust the process regarding the needs of the operator. The problem is with my AKS clusters I would never be able to upgrade the cluster without messing with the Operator because there it is not possible to interfere with the update process.

So maybe a configuration toggle in the config CRD or ConfigMap of the operator that allows disabling and enabling of the PDB (which leads to removal or creation) of them could solve this issue? Otherwise, there would be a need for a flag to disable them at all and to live with the failovers.

In case one it would then be able to remove the PDBs save in case of cluster update and just switch them back in later on.

alexeyklyukin · 2019-05-14T12:47:53Z

In AKS, they do it one by one. First, they drain the node, then they recreate it as far as I can see.

That's better, now the question is whether you can tell the old node from the new one. This is necessary in order to move Postgres pods from the node that is about to be migrated to the new node, avoiding repetitions of this process. The process coded into the operator relies on labels to distinguish between the old and new nodes, perhaps it is possible to label new nodes after the upgrade (maybe the ansible playbook referenced in the link you gave can be modified for this).

While executing this flow, I ran into issues, because having a Pod Disruption Budget which is targetting only 1 Pod, that requires at least one Pod, such processes won't be allowed from Kubernetes as expected, because it will hurt the PDB.

Nope, that is intended, the idea is to avoid taking the node down until the operator takes care of moving all primary Postgres pods from it. Obviously, it relies on the upgrade process being somehow compatible with what the operator is designed for, otherwise PDB will just come as an obstacle as in your case.

I see a good reason for disabling PDB to handle platforms that does not allow customizing the upgrade process.

containerpope · 2019-05-14T13:17:32Z

That's better, now the question is whether you can tell the old node from the new one. This is necessary in order to move Postgres pods from the node that is about to be migrated to the new node, avoiding repetitions of this process. The process coded into the operator relies on labels to distinguish between the old and new nodes, perhaps it is possible to label new nodes after the upgrade (maybe the ansible playbook referenced in the link you gave can be modified for this).

In the ansible flow this would be possible but in my opinion should be avoided, because each time they release a new installer version, this change must be migrated. In case of the Azure AKS flow, this is impossible, because you can just order a new version and they do the update for you. So in the managed clusters, you are unable to adjust anything related to the processes.

Nope, that is intended, the idea is to avoid taking the node down until the operator takes care of moving all primary Postgres pods from it. Obviously, it relies on the upgrade process being somehow compatible with what the operator is designed for, otherwise PDB will just come as an obstacle as in your case.

This makes perfect sense for me.

I see a good reason for disabling PDB to handle platforms that does not allow customizing the upgrade process.

I think the best would be to be able to disable and enable them on demand, so that you can just swap them out for a time window where you will execute the update and later on you can enable them by config, to obtain all the features it provides.
If this is too complicated to achieve, because it would require a kind of watchdog extra for these, maybe being able to disable them completely in a managed cluster would make sense.

containerpope · 2019-06-25T10:16:22Z

@FxKu is this feature already in the acid/postgres-operator:v1.1.0-28-g24d412a image that is installed by the master version of the helm chart, or is there another version that allows this?

bitpavel · 2020-12-25T10:31:14Z

My use case: trying to manage DB resource requests with VPA (https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler). Replica pod is updated successfully, but master pod can't be modified due to PDB.

The proposed PR allows to disable PDB completely, but it's a pretty radical solution IMO. If there was a possibility just to remove spilo-role: master from matchLabels, it would be a better alternative. Thus a cluster with multiple pods could be migrated/evicted, pods could be updated progressively, while minimizing a comlete DB service disruptions.
Currently such behaviour can be reached by setting enable_pod_disruption_budget to false and adding a custom PDB with no spilo-role constraint.

multi-io changed the title ~~PodDisruptionBudget prevents master pod migrations~~ PodDisruptionBudget prevents master pod evictions Apr 23, 2019

FxKu mentioned this issue May 7, 2019

Specifying a nodeSelector for Postgres Pods #420

Closed

FxKu added the discussion label May 14, 2019

FxKu added enhancement discussion and removed discussion labels May 17, 2019

FxKu changed the title ~~PodDisruptionBudget prevents master pod evictions~~ Make PodDisruptionBudget configurable May 17, 2019

FxKu mentioned this issue May 24, 2019

Don't create an impossible disruption budget for smaller clusters #384

Closed

FxKu mentioned this issue Jun 7, 2019

Add PDB configuration toggle #583

Merged

FxKu closed this as completed in #583 Jun 18, 2019

FxKu mentioned this issue Mar 25, 2020

Cannot evict master pod #867

Closed

smit-kiri mentioned this issue Aug 24, 2023

[Bug] [RayService] KubeRay does not handle voluntary disruptions well ray-project/kuberay#1333

Open

2 tasks

baznikin mentioned this issue Jan 18, 2024

Make PodDisruptionBudget configurable per cluster manifest #2517

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PodDisruptionBudget configurable #547

Make PodDisruptionBudget configurable #547

multi-io commented Apr 23, 2019 •

edited

Loading

alexeyklyukin commented Apr 23, 2019

multi-io commented Apr 23, 2019 •

edited

Loading

FxKu commented Apr 24, 2019

alexeyklyukin commented Apr 24, 2019

containerpope commented May 14, 2019

alexeyklyukin commented May 14, 2019

containerpope commented May 14, 2019

Jan-M commented May 14, 2019

Jan-M commented May 14, 2019

containerpope commented May 14, 2019

alexeyklyukin commented May 14, 2019

containerpope commented May 14, 2019

containerpope commented Jun 25, 2019

bitpavel commented Dec 25, 2020 •

edited

Loading

Make PodDisruptionBudget configurable #547

Make PodDisruptionBudget configurable #547

Comments

multi-io commented Apr 23, 2019 • edited Loading

alexeyklyukin commented Apr 23, 2019

multi-io commented Apr 23, 2019 • edited Loading

FxKu commented Apr 24, 2019

alexeyklyukin commented Apr 24, 2019

containerpope commented May 14, 2019

alexeyklyukin commented May 14, 2019

containerpope commented May 14, 2019

Jan-M commented May 14, 2019

Jan-M commented May 14, 2019

containerpope commented May 14, 2019

alexeyklyukin commented May 14, 2019

containerpope commented May 14, 2019

containerpope commented Jun 25, 2019

bitpavel commented Dec 25, 2020 • edited Loading

multi-io commented Apr 23, 2019 •

edited

Loading

multi-io commented Apr 23, 2019 •

edited

Loading

bitpavel commented Dec 25, 2020 •

edited

Loading