Using a restart_delay override creates a cluster with a persistent IncorrectProcess reconciliation problem #1976

gm42 · 2024-03-20T14:32:47Z

What happened?

Creating a cluster with a restart_delay custom parameter override leads to the very first pods being created with the IncorrectCommandline / IncorrectProcess problem.

This is what the logs for this newly created cluster contain (filtered for IncorrectProcess):

{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.180.127:4501 --locality_instance_id=coordinator-1 --locality_machineid=ip-10-20-186-38.us-east-1.compute.internal --locality_zoneid=ip-10-20-186-38.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.180.127:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.180.127:4501 --locality_instance_id=coordinator-1 --locality_machineid=ip-10-20-186-38.us-east-1.compute.internal --locality_zoneid=ip-10-20-186-38.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.180.127:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"coordinator-1","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.144.241:4501 --locality_instance_id=coordinator-2 --locality_machineid=ip-10-20-153-189.us-east-1.compute.internal --locality_zoneid=ip-10-20-153-189.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.144.241:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.144.241:4501 --locality_instance_id=coordinator-2 --locality_machineid=ip-10-20-153-189.us-east-1.compute.internal --locality_zoneid=ip-10-20-153-189.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.144.241:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"coordinator-2","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.120.70:4501 --locality_instance_id=coordinator-3 --locality_machineid=ip-10-20-161-74.us-east-1.compute.internal --locality_zoneid=ip-10-20-161-74.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.120.70:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.120.70:4501 --locality_instance_id=coordinator-3 --locality_machineid=ip-10-20-161-74.us-east-1.compute.internal --locality_zoneid=ip-10-20-161-74.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.120.70:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"coordinator-3","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=log --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.40.39:4501 --locality_instance_id=log-1 --locality_machineid=ip-10-20-185-12.us-east-1.compute.internal --locality_zoneid=ip-10-20-185-12.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.40.39:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=log --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.40.39:4501 --locality_instance_id=log-1 --locality_machineid=ip-10-20-185-12.us-east-1.compute.internal --locality_zoneid=ip-10-20-185-12.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.40.39:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"log-1","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=stateless --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.177.132:4501 --locality_instance_id=stateless-1 --locality_machineid=ip-10-20-72-89.us-east-1.compute.internal --locality_zoneid=ip-10-20-72-89.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.177.132:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=stateless --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.177.132:4501 --locality_instance_id=stateless-1 --locality_machineid=ip-10-20-72-89.us-east-1.compute.internal --locality_zoneid=ip-10-20-72-89.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.177.132:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"stateless-1","emptyMonitorConf":false}
{"level":"info","ts":"2024-03-20T14:16:01Z","logger":"controller","msg":"IncorrectProcess","namespace":"mynamespace","cluster":"mycluster","reconciler":"controllers.updateStatus","expected":"/usr/bin/fdbserver --class=storage --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.190.145:4501 --locality_instance_id=storage-1 --locality_machineid=ip-10-20-185-12.us-east-1.compute.internal --locality_zoneid=ip-10-20-185-12.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.190.145:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","got":"/usr/bin/fdbserver --class=storage --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.190.145:4501 --locality_instance_id=storage-1 --locality_machineid=ip-10-20-185-12.us-east-1.compute.internal --locality_zoneid=ip-10-20-185-12.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.190.145:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json","expectedVersion":"7.1.15","version":"7.1.15","processGroupID":"storage-1","emptyMonitorConf":false}

The issue does not get resolved because even deleting the pod leads to operator creating the pod with the still-incorrect command line.
If you look at the log lines you can see that this is the difference:

-/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.144.241:4501 --locality_instance_id=coordinator-2 --locality_machineid=ip-10-20-153-189.us-east-1.compute.internal --locality_zoneid=ip-10-20-153-189.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.144.241:4501 --restart_delay=15 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json
+/usr/bin/fdbserver --class=coordinator --cluster_file=/var/fdb/data/fdb.cluster --datadir=/var/fdb/data --listen_address=10.20.144.241:4501 --locality_instance_id=coordinator-2 --locality_machineid=ip-10-20-153-189.us-east-1.compute.internal --locality_zoneid=ip-10-20-153-189.us-east-1.compute.internal --logdir=/var/log/fdb-trace-logs --loggroup=mycluster --public_address=10.20.144.241:4501 --seed_cluster_file=/var/dynamic-conf/fdb.cluster --trace_format=json

What did you expect to happen?

A new cluster is created without IncorrectProcess problems.

(Unless I am doing something wrong which is leading to this, and completely missing it)

How can we reproduce it (as minimally and precisely as possible)?

Create this cluster, then observe the operator logs:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: mycluster
  namespace: mynamespace
spec:
  version: 7.1.15
  coordinatorSelection:
  - priority: 100
    processClass: coordinator
  databaseConfiguration:
    logs: 1
    storage: 1
    redundancy_mode: single
    storage_engine: ssd-2
  processCounts:
    coordinator: 3
    stateless: 1
    storage: 1
    log: 1
  processes:
    general:
      customParameters:
        - trace_format=json
        - restart_delay=15

Anything else we need to know?

No response

FDB Kubernetes operator

v1.33.0

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.16-eks-508b6b3", GitCommit:"ea12af9a627fef6922a0aa5fff59660ec8657d13", GitTreeState:"clean", BuildDate:"2024-01-29T21:00:01Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS

The text was updated successfully, but these errors were encountered:

gm42 · 2024-03-20T14:33:24Z

Forgot to mention: I haven't yet looked into code for a possible bugfix, but if I find the cause I'll submit a PR.

brownleej · 2024-03-20T15:17:40Z

I suspect the issue is that restart_delay is an fdbmonitor parameter, rather than an fdbserver parameter, so it gets absorbed by fdbmonitor and doesn't get propagated to the fdbserver process. It might be best to add a dedicated configuration option for it in the CRD.

gm42 · 2024-03-20T15:43:24Z

It is probably as you write; I made this PR for an unrelated trivial improvement: #1977

But could not find anything obviously wrong in GetMonitorProcessConfiguration() and its usage.

johscheuer · 2024-03-20T16:02:29Z

I suspect the issue is that restart_delay is an fdbmonitor parameter, rather than an fdbserver parameter, so it gets absorbed by fdbmonitor and doesn't get propagated to the fdbserver process. It might be best to add a dedicated configuration option for it in the CRD.

Instead of having another setting in the CRD it might be better to filter out those fdbmonitor parameters in GetStartCommandWithSubstitutions.

johscheuer · 2024-03-22T16:32:18Z

I looked a bit more into this and we have to add a dedicated setting for this: https://apple.github.io/foundationdb/configuration.html#general-section:

These restart- parameters are not applicable to the fdbmonitor process itself. See Configuring autorestart of fdbmonitor for details.

Is there any specific use-case for you to change those parameters? I would like to better understand the need before adding another setting to our already complex FoundationDBCluster resource.

gm42 · 2024-03-23T08:20:23Z

Is there any specific use-case for you to change those parameters? I would like to better understand the need before adding another setting to our already complex FoundationDBCluster resource.

It affects how quickly operator can recover a cluster, and interacts with the operator's own minimumUptimeSecondsForBounce; I was making changes to these restart parameters while troubleshooting an issue with clusters coming up as non-available depending on the order of pods starting up e.g. if all pods start more or less around the same time, likely cluster comes up as healthy and available without problems, instead when the issue presents itself the cluster is in this status:

WARNING: Long delay (Ctrl-C to interrupt)
Using cluster file `/var/fdb/data/fdb.cluster'.

Timed out fetching cluster status.

Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Coordinators           - unknown
  Usable Regions         - unknown

Cluster:
  FoundationDB processes - unknown
  Zones                  - unknown
  Machines               - 
  Machines               - unknown

Data:
  Replication health     - unknown
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - unknown

Operating space:
  Unable to retrieve operating space status

Workload:
  Read rate              - unknown
  Write rate             - unknown
  Transactions started   - unknown
  Transactions committed - unknown
  Conflict rate          - unknown

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 02/06/24 11:00:12

And only killing all stateless processes, or sometimes stateless + storage processes, can kick it out of this state. Operator also does not seem able to recover it without manually killing these processes, or perhaps not quickly enough (I am still testing), thus I was tuning those parameters.

It is not particularly important for me to tune this now, I can run a forked version anyways if I need to, but if you decide against making it configurable I suggest to make it a sort of validation error when someone tries to override it, so that it will be easier/faster to identify the issue.

johscheuer · 2024-04-12T13:42:35Z

Sorry for the delay.

I was making changes to these restart parameters while troubleshooting an issue with clusters coming up as non-available depending on the order of pods starting up e.g. if all pods start more or less around the same time, likely cluster comes up as healthy and available without problems, instead when the issue presents itself the cluster is in this status:

Do you have a reproducible test case for this? I'm not sure I understand the issue. The case that you're describing is that for a cluster that was previously running? I'm a bit surprised that the fdbserver processes would be crashing the whole time.

Operator also does not seem able to recover it without manually killing these processes, or perhaps not quickly enough (I am still testing), thus I was tuning those parameters.

If you have some tests/data for those cases and you are able to share them, that would be nice.

It is not particularly important for me to tune this now, I can run a forked version anyways if I need to, but if you decide against making it configurable I suggest to make it a sort of validation error when someone tries to override it, so that it will be easier/faster to identify the issue.

Right know I would tend to keep it as it is and rather add a check to prevent a user from setting it. If you got some data to show that a lower restart_delay will help in some cases I'm not against changing the default. Even thought that would mean we have to be careful on how to propagate this change.

gm42 · 2024-04-16T06:39:25Z

I was making changes to these restart parameters while troubleshooting an issue with clusters coming up as non-available depending on the order of pods starting up e.g. if all pods start more or less around the same time, likely cluster comes up as healthy and available without problems, instead when the issue presents itself the cluster is in this status:

Do you have a reproducible test case for this? I'm not sure I understand the issue. The case that you're describing is that for a cluster that was previously running? I'm a bit surprised that the fdbserver processes would be crashing the whole time.

No crashes; this is the case of a previously running cluster where pods are all deleted (thus, similar to #1984); operator is able to bring them back online, however (due to k8s reasons) they do not all schedule at the same time, which I suspect is the reason leading to cluster in an unknown state. I can also check what the individual processes are logging & report back, IIRC they are stuck with some networking problem (I am using service as routing mode).

Operator also does not seem able to recover it without manually killing these processes, or perhaps not quickly enough (I am still testing), thus I was tuning those parameters.

If you have some tests/data for those cases and you are able to share them, that would be nice.

I will try to reproduce it starting from a blank cluster, with steps; if I succeed I will open an issue.

It is not particularly important for me to tune this now, I can run a forked version anyways if I need to, but if you decide against making it configurable I suggest to make it a sort of validation error when someone tries to override it, so that it will be easier/faster to identify the issue.

Right know I would tend to keep it as it is and rather add a check to prevent a user from setting it. If you got some data to show that a lower restart_delay will help in some cases I'm not against changing the default. Even thought that would mean we have to be careful on how to propagate this change.

Thanks, it's fine. I suspect that this setting is irrelevant for what I am testing.

gm42 added the bug Something isn't working label Mar 20, 2024

johscheuer mentioned this issue Apr 12, 2024

Document the limitation that only custom parameters for fdbserver can be defined #1988

Merged

johscheuer closed this as completed in #1988 Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a restart_delay override creates a cluster with a persistent IncorrectProcess reconciliation problem #1976

Using a restart_delay override creates a cluster with a persistent IncorrectProcess reconciliation problem #1976

gm42 commented Mar 20, 2024

gm42 commented Mar 20, 2024 •

edited

Loading

brownleej commented Mar 20, 2024 •

edited

Loading

gm42 commented Mar 20, 2024

johscheuer commented Mar 20, 2024

johscheuer commented Mar 22, 2024

gm42 commented Mar 23, 2024

johscheuer commented Apr 12, 2024

gm42 commented Apr 16, 2024

Using a restart_delay override creates a cluster with a persistent IncorrectProcess reconciliation problem #1976

Using a restart_delay override creates a cluster with a persistent IncorrectProcess reconciliation problem #1976

Comments

gm42 commented Mar 20, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

FDB Kubernetes operator

Kubernetes version

Cloud provider

gm42 commented Mar 20, 2024 • edited Loading

brownleej commented Mar 20, 2024 • edited Loading

gm42 commented Mar 20, 2024

johscheuer commented Mar 20, 2024

johscheuer commented Mar 22, 2024

gm42 commented Mar 23, 2024

johscheuer commented Apr 12, 2024

gm42 commented Apr 16, 2024

gm42 commented Mar 20, 2024 •

edited

Loading

brownleej commented Mar 20, 2024 •

edited

Loading