Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maxConcurrentReplacements causing deletion update strategy #1918

Closed
simenl opened this issue Jan 12, 2024 · 5 comments
Closed

maxConcurrentReplacements causing deletion update strategy #1918

simenl opened this issue Jan 12, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@simenl
Copy link
Collaborator

simenl commented Jan 12, 2024

What happened?

As a mitigation to storage roles being recruited to log processes [forum post], we tried setting maxConcurrentReplacements to reduce the number of concurrent exclusions.
However, this caused the Deletion strategy to incorrectly be applied on the remaining processes, for updates that requires the Replacement strategy. Consequently this resulted in unschedulable pods, as we updated the node selector to an availability zone incompatible with the existing persistent volume on the process.

What did you expect to happen?

Processes that requires replacement, should not be eligible for the delete update strategy. Even if they were not selected for replacement (yet) due to maxConcurrentReplacements.

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a foundationdb cluster with maxConcurrentReplacements and multiple storage pods/processes:
spec:
  automationOptions:
    maxConcurrentReplacements: 1
  1. Make an update to the CRD that requires a replacement. E.g. changing the the node selector.

  2. Observe that some of the storage pods will be updated through the delete update strategy

Anything else we need to know?

No response

FDB Kubernetes operator

We run FoundationDB on kubernetes through the fdb-kubernetes-operator: v1.28.1.
FoundationDB version: 7.1.43

Kubernetes version

version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.7-gke.500"}

Cloud provider

GCP

@simenl simenl added the bug Something isn't working label Jan 12, 2024
@johscheuer
Copy link
Member

Thanks for the report, do you think you are able to work on a fix and test for this?

@simenl
Copy link
Collaborator Author

simenl commented Jan 19, 2024

Yes 👍
I'm thinking of reusing the checks from ReplaceMisconfiguredProcessGroups in getPodsToUpdate. Does that seem sound?

@johscheuer
Copy link
Member

That sounds reasonable 👍

@johscheuer
Copy link
Member

Hello @simenl, just wanted to see how things are going on and if you're able to work on a fix for this?

@simenl
Copy link
Collaborator Author

simenl commented Mar 4, 2024

Hi @johscheuer,
sorry, I got caught up with other things.

I've put up a draft: #1954

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants