-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using a restart_delay override creates a cluster with a persistent IncorrectProcess reconciliation problem #1976
Comments
Forgot to mention: I haven't yet looked into code for a possible bugfix, but if I find the cause I'll submit a PR. |
I suspect the issue is that restart_delay is an fdbmonitor parameter, rather than an fdbserver parameter, so it gets absorbed by fdbmonitor and doesn't get propagated to the fdbserver process. It might be best to add a dedicated configuration option for it in the CRD. |
It is probably as you write; I made this PR for an unrelated trivial improvement: #1977 But could not find anything obviously wrong in |
Instead of having another setting in the CRD it might be better to filter out those fdbmonitor parameters in GetStartCommandWithSubstitutions. |
I looked a bit more into this and we have to add a dedicated setting for this: https://apple.github.io/foundationdb/configuration.html#general-section:
Is there any specific use-case for you to change those parameters? I would like to better understand the need before adding another setting to our already complex |
It affects how quickly operator can recover a cluster, and interacts with the operator's own
And only killing all It is not particularly important for me to tune this now, I can run a forked version anyways if I need to, but if you decide against making it configurable I suggest to make it a sort of validation error when someone tries to override it, so that it will be easier/faster to identify the issue. |
Sorry for the delay.
Do you have a reproducible test case for this? I'm not sure I understand the issue. The case that you're describing is that for a cluster that was previously running? I'm a bit surprised that the fdbserver processes would be crashing the whole time.
If you have some tests/data for those cases and you are able to share them, that would be nice.
Right know I would tend to keep it as it is and rather add a check to prevent a user from setting it. If you got some data to show that a lower |
No crashes; this is the case of a previously running cluster where pods are all deleted (thus, similar to #1984); operator is able to bring them back online, however (due to k8s reasons) they do not all schedule at the same time, which I suspect is the reason leading to cluster in an
I will try to reproduce it starting from a blank cluster, with steps; if I succeed I will open an issue.
Thanks, it's fine. I suspect that this setting is irrelevant for what I am testing. |
What happened?
Creating a cluster with a
restart_delay
custom parameter override leads to the very first pods being created with theIncorrectCommandline
/IncorrectProcess
problem.This is what the logs for this newly created cluster contain (filtered for
IncorrectProcess
):The issue does not get resolved because even deleting the pod leads to operator creating the pod with the still-incorrect command line.
If you look at the log lines you can see that this is the difference:
What did you expect to happen?
A new cluster is created without
IncorrectProcess
problems.(Unless I am doing something wrong which is leading to this, and completely missing it)
How can we reproduce it (as minimally and precisely as possible)?
Create this cluster, then observe the operator logs:
Anything else we need to know?
No response
FDB Kubernetes operator
v1.33.0
Kubernetes version
Cloud provider
The text was updated successfully, but these errors were encountered: