Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a restart_delay override creates a cluster with a persistent IncorrectProcess reconciliation problem #1976

Closed
gm42 opened this issue Mar 20, 2024 · 8 comments · Fixed by #1988
Labels
bug Something isn't working

Comments

@gm42
Copy link
Contributor

gm42 commented Mar 20, 2024

What happened?

Creating a cluster with a restart_delay custom parameter override leads to the very first pods being created with the IncorrectCommandline / IncorrectProcess problem.

This is what the logs for this newly created cluster contain (filtered for IncorrectProcess):

{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.180.127:4501 --locality_instance_id=coordinator-1 --locality_machineid=ip-10-20-186-38.us-east-1.compute.internal --locality_zoneid=ip-10-20-186-38.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.180.127:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.180.127:4501 --locality_instance_id=coordinator-1 --locality_machineid=ip-10-20-186-38.us-east-1.compute.internal --locality_zoneid=ip-10-20-186-38.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.180.127:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"coordinator-1","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.144.241:4501 --locality_instance_id=coordinator-2 --locality_machineid=ip-10-20-153-189.us-east-1.compute.internal --locality_zoneid=ip-10-20-153-189.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.144.241:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.144.241:4501 --locality_instance_id=coordinator-2 --locality_machineid=ip-10-20-153-189.us-east-1.compute.internal --locality_zoneid=ip-10-20-153-189.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.144.241:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"coordinator-2","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.120.70:4501 --locality_instance_id=coordinator-3 --locality_machineid=ip-10-20-161-74.us-east-1.compute.internal --locality_zoneid=ip-10-20-161-74.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.120.70:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.120.70:4501 --locality_instance_id=coordinator-3 --locality_machineid=ip-10-20-161-74.us-east-1.compute.internal --locality_zoneid=ip-10-20-161-74.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.120.70:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"coordinator-3","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=log --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.40.39:4501 --locality_instance_id=log-1 --locality_machineid=ip-10-20-185-12.us-east-1.compute.internal --locality_zoneid=ip-10-20-185-12.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.40.39:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=log --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.40.39:4501 --locality_instance_id=log-1 --locality_machineid=ip-10-20-185-12.us-east-1.compute.internal --locality_zoneid=ip-10-20-185-12.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.40.39:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"log-1","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=stateless --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.177.132:4501 --locality_instance_id=stateless-1 --locality_machineid=ip-10-20-72-89.us-east-1.compute.internal --locality_zoneid=ip-10-20-72-89.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.177.132:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=stateless --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.177.132:4501 --locality_instance_id=stateless-1 --locality_machineid=ip-10-20-72-89.us-east-1.compute.internal --locality_zoneid=ip-10-20-72-89.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.177.132:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"stateless-1","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=storage --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.190.145:4501 --locality_instance_id=storage-1 --locality_machineid=ip-10-20-185-12.us-east-1.compute.internal --locality_zoneid=ip-10-20-185-12.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.190.145:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=storage --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.190.145:4501 --locality_instance_id=storage-1 --locality_machineid=ip-10-20-185-12.us-east-1.compute.internal --locality_zoneid=ip-10-20-185-12.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.190.145:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"storage-1","emptyMonitorConf":false}

The issue does not get resolved because even deleting the pod leads to operator creating the pod with the still-incorrect command line.
If you look at the log lines you can see that this is the difference:

-/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.144.241:4501 --locality_instance_id=coordinator-2 --locality_machineid=ip-10-20-153-189.us-east-1.compute.internal --locality_zoneid=ip-10-20-153-189.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.144.241:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json
+/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.144.241:4501 --locality_instance_id=coordinator-2 --locality_machineid=ip-10-20-153-189.us-east-1.compute.internal --locality_zoneid=ip-10-20-153-189.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.144.241:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json

What did you expect to happen?

A new cluster is created without IncorrectProcess problems.

(Unless I am doing something wrong which is leading to this, and completely missing it)

How can we reproduce it (as minimally and precisely as possible)?

Create this cluster, then observe the operator logs:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: mycluster
  namespace: mynamespace
spec:
  version: 7.1.15
  coordinatorSelection:
  - priority: 100
    processClass: coordinator
  databaseConfiguration:
    logs: 1
    storage: 1
    redundancy_mode: single
    storage_engine: ssd-2
  processCounts:
    coordinator: 3
    stateless: 1
    storage: 1
    log: 1
  processes:
    general:
      customParameters:
        - trace_format=json
        - restart_delay=15

Anything else we need to know?

No response

FDB Kubernetes operator

v1.33.0

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.16-eks-508b6b3", GitCommit:"ea12af9a627fef6922a0aa5fff59660ec8657d13", GitTreeState:"clean", BuildDate:"2024-01-29T21:00:01Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS
@gm42 gm42 added the bug Something isn't working label Mar 20, 2024
@gm42
Copy link
Contributor Author

gm42 commented Mar 20, 2024

Forgot to mention: I haven't yet looked into code for a possible bugfix, but if I find the cause I'll submit a PR.

@brownleej
Copy link
Member

brownleej commented Mar 20, 2024

I suspect the issue is that restart_delay is an fdbmonitor parameter, rather than an fdbserver parameter, so it gets absorbed by fdbmonitor and doesn't get propagated to the fdbserver process. It might be best to add a dedicated configuration option for it in the CRD.

@gm42
Copy link
Contributor Author

gm42 commented Mar 20, 2024

It is probably as you write; I made this PR for an unrelated trivial improvement: #1977

But could not find anything obviously wrong in GetMonitorProcessConfiguration() and its usage.

@johscheuer
Copy link
Member

I suspect the issue is that restart_delay is an fdbmonitor parameter, rather than an fdbserver parameter, so it gets absorbed by fdbmonitor and doesn't get propagated to the fdbserver process. It might be best to add a dedicated configuration option for it in the CRD.

Instead of having another setting in the CRD it might be better to filter out those fdbmonitor parameters in GetStartCommandWithSubstitutions.

@johscheuer
Copy link
Member

I looked a bit more into this and we have to add a dedicated setting for this: https://apple.github.io/foundationdb/configuration.html#general-section:

These restart- parameters are not applicable to the fdbmonitor process itself. See Configuring autorestart of fdbmonitor for details.

Is there any specific use-case for you to change those parameters? I would like to better understand the need before adding another setting to our already complex FoundationDBCluster resource.

@gm42
Copy link
Contributor Author

gm42 commented Mar 23, 2024

Is there any specific use-case for you to change those parameters? I would like to better understand the need before adding another setting to our already complex FoundationDBCluster resource.

It affects how quickly operator can recover a cluster, and interacts with the operator's own minimumUptimeSecondsForBounce; I was making changes to these restart parameters while troubleshooting an issue with clusters coming up as non-available depending on the order of pods starting up e.g. if all pods start more or less around the same time, likely cluster comes up as healthy and available without problems, instead when the issue presents itself the cluster is in this status:

WARNING: Long delay (Ctrl-C to interrupt)
Using cluster file `/var/fdb/data/fdb.cluster'.

Timed out fetching cluster status.

Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Coordinators           - unknown
  Usable Regions         - unknown

Cluster:
  FoundationDB processes - unknown
  Zones                  - unknown
  Machines               - 
  Machines               - unknown

Data:
  Replication health     - unknown
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - unknown

Operating space:
  Unable to retrieve operating space status

Workload:
  Read rate              - unknown
  Write rate             - unknown
  Transactions started   - unknown
  Transactions committed - unknown
  Conflict rate          - unknown

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 02/06/24 11:00:12

And only killing all stateless processes, or sometimes stateless + storage processes, can kick it out of this state. Operator also does not seem able to recover it without manually killing these processes, or perhaps not quickly enough (I am still testing), thus I was tuning those parameters.

It is not particularly important for me to tune this now, I can run a forked version anyways if I need to, but if you decide against making it configurable I suggest to make it a sort of validation error when someone tries to override it, so that it will be easier/faster to identify the issue.

@johscheuer
Copy link
Member

Sorry for the delay.

I was making changes to these restart parameters while troubleshooting an issue with clusters coming up as non-available depending on the order of pods starting up e.g. if all pods start more or less around the same time, likely cluster comes up as healthy and available without problems, instead when the issue presents itself the cluster is in this status:

Do you have a reproducible test case for this? I'm not sure I understand the issue. The case that you're describing is that for a cluster that was previously running? I'm a bit surprised that the fdbserver processes would be crashing the whole time.

Operator also does not seem able to recover it without manually killing these processes, or perhaps not quickly enough (I am still testing), thus I was tuning those parameters.

If you have some tests/data for those cases and you are able to share them, that would be nice.

It is not particularly important for me to tune this now, I can run a forked version anyways if I need to, but if you decide against making it configurable I suggest to make it a sort of validation error when someone tries to override it, so that it will be easier/faster to identify the issue.

Right know I would tend to keep it as it is and rather add a check to prevent a user from setting it. If you got some data to show that a lower restart_delay will help in some cases I'm not against changing the default. Even thought that would mean we have to be careful on how to propagate this change.

@gm42
Copy link
Contributor Author

gm42 commented Apr 16, 2024

I was making changes to these restart parameters while troubleshooting an issue with clusters coming up as non-available depending on the order of pods starting up e.g. if all pods start more or less around the same time, likely cluster comes up as healthy and available without problems, instead when the issue presents itself the cluster is in this status:

Do you have a reproducible test case for this? I'm not sure I understand the issue. The case that you're describing is that for a cluster that was previously running? I'm a bit surprised that the fdbserver processes would be crashing the whole time.

No crashes; this is the case of a previously running cluster where pods are all deleted (thus, similar to #1984); operator is able to bring them back online, however (due to k8s reasons) they do not all schedule at the same time, which I suspect is the reason leading to cluster in an unknown state. I can also check what the individual processes are logging & report back, IIRC they are stuck with some networking problem (I am using service as routing mode).

Operator also does not seem able to recover it without manually killing these processes, or perhaps not quickly enough (I am still testing), thus I was tuning those parameters.

If you have some tests/data for those cases and you are able to share them, that would be nice.

I will try to reproduce it starting from a blank cluster, with steps; if I succeed I will open an issue.

It is not particularly important for me to tune this now, I can run a forked version anyways if I need to, but if you decide against making it configurable I suggest to make it a sort of validation error when someone tries to override it, so that it will be easier/faster to identify the issue.

Right know I would tend to keep it as it is and rather add a check to prevent a user from setting it. If you got some data to show that a lower restart_delay will help in some cases I'm not against changing the default. Even thought that would mean we have to be careful on how to propagate this change.

Thanks, it's fine. I suspect that this setting is irrelevant for what I am testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants