You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 20, 2022. It is now read-only.
We find that after a restart the controller could accidentally invoke DeletePodDisruptionBudget and then CreatePodDisruptionBudget, which leads to an unexpected change (decrease) of PDB's MaxPodUnavailable during the process of ensureCassandraPodDisruptionBudget. In that case, when the user increases MaxPodUnavailable to force the operator to continue its actions even if there are multiple disruptions ongoing, due to the unexpected decrease of MaxPodUnavailable, the operator will fail to take any actions including any creation or update to the cluster's statefulset.
We find that the root cause is the staleness populated from the apiserver that makes the controller think the "desired" MaxPodUnavailable (from the stale CassandraCluster spec) does not equal the MaxPodUnavailable in the current PodDisruptionBudget and recreates the PodDisruptionBudget to make the MaxPodUnavailable to the stale read value.
One potential approach to fix this is to save a copy of CassandraCluster's ResourceVersion inside the PDB instance when it is created. Each time we are going to recreate the PDB instance, we are required to compare the current CassandraCluster's ResourceVersion with what is stored in PDB instance to see whether the former version is newer than the latter. If it is not the case, it then indicates this is a stale read of the CassandraCluster, and we should not update the PDB.
What did you do?
Concrete steps that led to this issue (in a HA k8s cluster):
Create a CassandraCluster cassandra-cluster with MaxPodUnavailable set to 1. The controller is talking to apiserver1. At this time, the corresponding PDB is also created by the operator with MaxPodUnavailable set to 1.
Resize the MaxPodUnavailable of cassandra-cluster to 4. The controller will find the desired MaxPodUnavailable is different than the current one so it will delete the current PDB and recreate the PDB with MaxPodUnavailable set to 4. Meanwhile, apiserver2 gets stuck at the view where cassandra-cluster's MaxPodUnavailable is still 1 (due to running slow or network issues).
The controller restarts after a node failure and talks to the stale apiserver2. Reading the stale information about cassandra-cluster from apiserver2, the controller thinks the "desired" MaxPodUnavailable is still 1 that is different from the current one (4) in PDB and the controller will delete the PDB and recreate the PDB with MaxPodUnavailable set to 1.
What did you expect to see?
PDB should not be affected when the controller reads stale information about cassandra-cluster.
What did you see instead? Under which circumstances?
PDB gets recreated with a smaller MaxPodUnavailable which can potentially block user operation if disruption is happening as mentioned above.
Possible Solution
We are willing to help fix this bug by issuing a PR.
As mentioned above, we can attach CRD's ResourceVersion to PDB. In that case, the operator will be able to tell that the "desired" MaxPodUnavailable shown in the CassandraCluster spec is invalid (stale) by checking whether the ResourceVersion (as an incremental ID) of CRD is older than what is attached in PDB. If that is the stale read case, the operator should not recreate the PDB, but return with an error.
The text was updated successfully, but these errors were encountered:
Sorry for the delay. We sent a fix for this problem here: #355
It seems that the fix breaks some unit tests. We will try to find a better way to gracefully solve the problem without introducing test failures.
@srteam2020 also it must be working for existing clusters. So if it could prevent an existing cluster from working it needs to be activated using a parameter or another non disruptive solution
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Bug Report
We find that after a restart the controller could accidentally invoke
DeletePodDisruptionBudget
and thenCreatePodDisruptionBudget
, which leads to an unexpected change (decrease) of PDB's MaxPodUnavailable during the process ofensureCassandraPodDisruptionBudget
. In that case, when the user increases MaxPodUnavailable to force the operator to continue its actions even if there are multiple disruptions ongoing, due to the unexpected decrease of MaxPodUnavailable, the operator will fail to take any actions including any creation or update to the cluster's statefulset.We find that the root cause is the staleness populated from the apiserver that makes the controller think the "desired" MaxPodUnavailable (from the stale CassandraCluster spec) does not equal the MaxPodUnavailable in the current PodDisruptionBudget and recreates the PodDisruptionBudget to make the MaxPodUnavailable to the stale read value.
One potential approach to fix this is to save a copy of CassandraCluster's ResourceVersion inside the PDB instance when it is created. Each time we are going to recreate the PDB instance, we are required to compare the current CassandraCluster's ResourceVersion with what is stored in PDB instance to see whether the former version is newer than the latter. If it is not the case, it then indicates this is a stale read of the CassandraCluster, and we should not update the PDB.
What did you do?
Concrete steps that led to this issue (in a HA k8s cluster):
cassandra-cluster
with MaxPodUnavailable set to 1. The controller is talking to apiserver1. At this time, the corresponding PDB is also created by the operator with MaxPodUnavailable set to 1.cassandra-cluster
to 4. The controller will find the desired MaxPodUnavailable is different than the current one so it will delete the current PDB and recreate the PDB with MaxPodUnavailable set to 4. Meanwhile, apiserver2 gets stuck at the view wherecassandra-cluster
's MaxPodUnavailable is still 1 (due to running slow or network issues).cassandra-cluster
from apiserver2, the controller thinks the "desired" MaxPodUnavailable is still 1 that is different from the current one (4) in PDB and the controller will delete the PDB and recreate the PDB with MaxPodUnavailable set to 1.What did you expect to see?
PDB should not be affected when the controller reads stale information about
cassandra-cluster
.What did you see instead? Under which circumstances?
PDB gets recreated with a smaller MaxPodUnavailable which can potentially block user operation if disruption is happening as mentioned above.
casskop version:
f87c8e0 (master branch)
Kubernetes version information:
1.18.9
Cassandra version:
3.11
Possible Solution
We are willing to help fix this bug by issuing a PR.
As mentioned above, we can attach CRD's ResourceVersion to PDB. In that case, the operator will be able to tell that the "desired" MaxPodUnavailable shown in the CassandraCluster spec is invalid (stale) by checking whether the ResourceVersion (as an incremental ID) of CRD is older than what is attached in PDB. If that is the stale read case, the operator should not recreate the PDB, but return with an error.
The text was updated successfully, but these errors were encountered: