-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster: delete PVCs after decom if flag is set #206
Conversation
898cce6
to
954dc9c
Compare
97fe759
to
f1114b7
Compare
if err := c.Status().Update(ctx, pandaCluster); err != nil { | ||
return fmt.Errorf("could not scale cluster %s to %d replicas: %w", pandaCluster.Name, replicas, err) | ||
|
||
err := retry.RetryOnConflict(retry.DefaultRetry, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm seeing often issues with conflicts in this update. now using retry as we should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using retry why not use a patch here? Patches, by default, won't check resource version and will only change the value that you've changed since the last read (Assuming you're using a client side patch).
.Status
is the one exception from my preference to rely on resource version checking. You've already done the work, so you're the one that know the actual state of the status. Statuses, despite being a subresource, still rely on the checking the resource version of the whole object. So you might be getting conflicts due to someone making a change to the Spec which doesn't impact your change on the status.
f38cc49
to
96164cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM though I'd encourage you to use a patch instead of a retry loop for the status update before merging.
var pvcReclaimRetentionPolicy appsv1.StatefulSetPersistentVolumeClaimRetentionPolicy | ||
if r.autoDeletePVCs { | ||
pvcReclaimRetentionPolicy.WhenScaled = appsv1.DeletePersistentVolumeClaimRetentionPolicyType | ||
pvcReclaimRetentionPolicy.WhenDeleted = appsv1.DeletePersistentVolumeClaimRetentionPolicyType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paranoia in me says we may want to keep when deleted as retain? I can't really imagine someone scaling the cluster manually but I could see cases where the STS needs to be deleted with orphan = true to work around it's immutability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're so right. GOOD CALL!
Delete on orphan DOES happen, and is actively used by the operator. Will change When deleted to RETAIN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually need to test if orphan delete triggers this behavior..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like deleting with orphan does not get PVCs deleted. obviously, since pods are retained as well. i can add a test. then it's fine to keep using delete on delete, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to Patch call. moved the patch helper func from cluster_controller to pkg/patch (pkg/utils does not work because of import conflict w/ types package)
if err := c.Status().Update(ctx, pandaCluster); err != nil { | ||
return fmt.Errorf("could not scale cluster %s to %d replicas: %w", pandaCluster.Name, replicas, err) | ||
|
||
err := retry.RetryOnConflict(retry.DefaultRetry, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using retry why not use a patch here? Patches, by default, won't check resource version and will only change the value that you've changed since the last read (Assuming you're using a client side patch).
.Status
is the one exception from my preference to rely on resource version checking. You've already done the work, so you're the one that know the actual state of the status. Statuses, despite being a subresource, still rely on the checking the resource version of the whole object. So you might be getting conflicts due to someone making a change to the Spec which doesn't impact your change on the status.
cc98b0f
to
b3e09b6
Compare
OK, i will change it. I followed patterns existing in the operator already, where Update was used. But happy to use Patch. |
9f6d670
to
4883964
Compare
if auto-delete-pvcs is set, use PersistentVolumeClaimRetentionPolicy to auto delete PVCs at scale down and delete. this allows us to scale down and up; PVCs are worthless after scaling down because we decomission before scaling down. A decomissioned broker can not rejoin the cluster.
6e476a7
to
824cea8
Compare
since this feature requires k8s 1.28, i did the following additional changes:
ready for re-review @RafalKorepta @chrisseto - thank you! |
824cea8
to
7d440ba
Compare
it's not possible to toggle feature flags on a per test case basis. therefore, as per advice from Rafal, we're adding a new test suite. it also uses k8s 1.28, as this is what cloud uses, and is required by the new code under test.
7d440ba
to
8550d70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
apiVersion: kuttl.dev/v1beta1 | ||
kind: TestAssert | ||
commands: | ||
- timeout: 300 | ||
script: | | ||
kubectl wait --for=delete pvc/datadir-decommission-2 --timeout 0s -n redpanda --namespace $NAMESPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I would default with TestStep, but then it hard to assert. You approach is better.
To scale down and up again, we need to delete PVCs after a broker/pod
has been removed.
if auto-delete-pvcs is set, use PersistentVolumeClaimRetentionPolicy to
auto delete PVCs at scale down and delete.
this allows us to scale down and up; PVCs are worthless after scaling
down because we decomission before scaling down. A decomissioned broker
can not rejoin the cluster.