maxConcurrentReplacements causing deletion update strategy #1918

simenl · 2024-01-12T13:53:43Z

What happened?

As a mitigation to storage roles being recruited to log processes [forum post], we tried setting maxConcurrentReplacements to reduce the number of concurrent exclusions.
However, this caused the Deletion strategy to incorrectly be applied on the remaining processes, for updates that requires the Replacement strategy. Consequently this resulted in unschedulable pods, as we updated the node selector to an availability zone incompatible with the existing persistent volume on the process.

What did you expect to happen?

Processes that requires replacement, should not be eligible for the delete update strategy. Even if they were not selected for replacement (yet) due to maxConcurrentReplacements.

How can we reproduce it (as minimally and precisely as possible)?

Create a foundationdb cluster with maxConcurrentReplacements and multiple storage pods/processes:

spec:
  automationOptions:
    maxConcurrentReplacements: 1

Make an update to the CRD that requires a replacement. E.g. changing the the node selector.
Observe that some of the storage pods will be updated through the delete update strategy

Anything else we need to know?

No response

FDB Kubernetes operator

We run FoundationDB on kubernetes through the fdb-kubernetes-operator: v1.28.1.
FoundationDB version: 7.1.43

Kubernetes version

version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.7-gke.500"}

Cloud provider

GCP

The text was updated successfully, but these errors were encountered:

johscheuer · 2024-01-18T13:21:05Z

Thanks for the report, do you think you are able to work on a fix and test for this?

simenl · 2024-01-19T09:55:52Z

Yes 👍
I'm thinking of reusing the checks from ReplaceMisconfiguredProcessGroups in getPodsToUpdate. Does that seem sound?

johscheuer · 2024-01-19T16:37:23Z

That sounds reasonable 👍

johscheuer · 2024-03-02T09:39:49Z

Hello @simenl, just wanted to see how things are going on and if you're able to work on a fix for this?

simenl · 2024-03-04T00:03:55Z

Hi @johscheuer,
sorry, I got caught up with other things.

I've put up a draft: #1954

simenl added the bug Something isn't working label Jan 12, 2024

simenl mentioned this issue Mar 3, 2024

Update pods check needs removal #1954

Merged

johscheuer closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maxConcurrentReplacements causing deletion update strategy #1918

maxConcurrentReplacements causing deletion update strategy #1918

simenl commented Jan 12, 2024

johscheuer commented Jan 18, 2024

simenl commented Jan 19, 2024

johscheuer commented Jan 19, 2024

johscheuer commented Mar 2, 2024

simenl commented Mar 4, 2024

maxConcurrentReplacements causing deletion update strategy #1918

maxConcurrentReplacements causing deletion update strategy #1918

Comments

simenl commented Jan 12, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

FDB Kubernetes operator

Kubernetes version

Cloud provider

johscheuer commented Jan 18, 2024

simenl commented Jan 19, 2024

johscheuer commented Jan 19, 2024

johscheuer commented Mar 2, 2024

simenl commented Mar 4, 2024