Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not exec into pods during rolling update #308

Merged
merged 3 commits into from
Sep 15, 2020

Conversation

ansd
Copy link
Member

@ansd ansd commented Sep 2, 2020

This closes #304

Exec into pods only if StatefulSet is ready and up to date.

Before this commit, we observed in #304 that the controller tried to exec into pods at the same time as the pods got updated due to a StatefulSet restart resulting in connection errors.

The main change is to not set plugins if

sts.Status.ReadyReplicas < desiredReplicas || sts.Status.UpdatedReplicas < desiredReplicas

because during a StatefulSet restart, it happened multiple times in #304 that since sts.Status.ReadyReplicas == desiredReplicas, the exec commands got executed and the connection got interrupted because the pod got updated.

The main finding is that sts.Status.ReadyReplicas == desiredReplicas can be true although the StatefulSet rolling update is still ongoing:

$ while true; do  kubectl get statefulsets.apps definition-rabbitmq-server --template={{.status}}; echo "" ; sleep 1; done

map[collisionCount:0 currentReplicas:2 currentRevision:definition-rabbitmq-server-56b65fdc4f observedGeneration:5 readyReplicas:3 replicas:3 updateRevision:definition-rabbitmq-server-6f667f5f7d]
…
map[collisionCount:0 currentReplicas:2 currentRevision:definition-rabbitmq-server-56b65fdc4f observedGeneration:5 readyReplicas:2 replicas:3 updateRevision:definition-rabbitmq-server-6f667f5f7d]
…
map[collisionCount:0 currentReplicas:2 currentRevision:definition-rabbitmq-server-56b65fdc4f observedGeneration:5 readyReplicas:2 replicas:3 updateRevision:definition-rabbitmq-server-6f667f5f7d updatedReplicas:1]
…
map[collisionCount:0 currentReplicas:1 currentRevision:definition-rabbitmq-server-56b65fdc4f observedGeneration:5 readyReplicas:3 replicas:3 updateRevision:definition-rabbitmq-server-6f667f5f7d updatedReplicas:1]
…
map[collisionCount:0 currentReplicas:1 currentRevision:definition-rabbitmq-server-56b65fdc4f observedGeneration:5 readyReplicas:2 replicas:3 updateRevision:definition-rabbitmq-server-6f667f5f7d updatedReplicas:1]
…
map[collisionCount:0 currentReplicas:1 currentRevision:definition-rabbitmq-server-56b65fdc4f observedGeneration:5 readyReplicas:2 replicas:3 updateRevision:definition-rabbitmq-server-6f667f5f7d updatedReplicas:2]
…

Thank you @Gsantomaggio for helping troubleshooting 🙂

@mkuratczyk
Copy link
Collaborator

Why do we even try to run rabbitmq-plugins set if the cluster is restarting? Since #224 is done, we should only exec into the container to enable/disable plugins if the plugins ConfigMap is changed while the cluster is running. If the cluster configuration is changed (a different ConfigMap), we should restart the cluster and it should load plugins on startup based on the plugins ConfigMap - no need to run rabbitmq-plugins set at all. What am I missing? :)

@Gsantomaggio
Copy link
Member

Gsantomaggio commented Sep 3, 2020

It doesn't crash anymore, but I think there is still some problem.

I tried to change the config map, by changing some value:
Screenshot from 2020-09-03 18-40-27

Changed:

queue_master_locator                                     = random
cluster_formation.node_cleanup.interval         = 50

Then if you check the config map, the values don't change

 ~ kubectl get configmaps definition-rabbitmq-server-conf -o yaml | grep queue_master_locator
queue_master_locator                            = min-masters

@ansd
Copy link
Member Author

ansd commented Sep 3, 2020

@mkuratczyk as we discussed this morning, you're right, we should only exec into the container to enable/disable plugins if the plugins ConfigMap is changed. We can and should still use the same guards as in this PR to only exec if all replicas are ready and not in the middle of a rolling update.

@ansd
Copy link
Member Author

ansd commented Sep 4, 2020

@Gsantomaggio that's desired behaviour.

You changed the configmap directly (e.g. via kubectl edit configmap definition-rabbitmq-server-conf).
Your configmap changed indeed, but your changes got shortly thereafter overwritten by the controller reconcile loop.

You should instead apply the RabbitMQ Custom Resource setting the additionalConfig like:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
...
spec:
  rabbitmq:
    additionalConfig: |
      queue_master_locator = random
      cluster_formation.node_cleanup.interval = 50

@ansd ansd force-pushed the fix-error-after-config-map-change branch from 37200e7 to 5a28d29 Compare September 8, 2020 08:04
@Gsantomaggio
Copy link
Member

Ok, thank you @ansd

So why does the operator reboot the pod when the config map is changed?
I mean, if the original configuration is restored, the operator should not reboot the pods, right? Or I am missing something :)!

@ferozjilla ferozjilla self-requested a review September 9, 2020 14:36
@ansd
Copy link
Member Author

ansd commented Sep 9, 2020

Hey @Gsantomaggio 🙂

Whenever the configuration is updated, the operator restarts the StatefulSet.

That's desired behaviour because https://www.rabbitmq.com/configure.html states that

rabbitmq.conf and advanced.config changes take effect after a node restart

This logic happens in

if builder.UpdateRequiresStsRestart() && operationResult == controllerutil.OperationResultUpdated {
and

In your specific case, you're right that the operator unnecessarily restarts the StatefulSet. So that's a side effect of changing the configuration in the wrong way.

@ansd ansd force-pushed the fix-error-after-config-map-change branch from 5a28d29 to 6e8f8b1 Compare September 11, 2020 12:48
@Zerpet Zerpet self-assigned this Sep 15, 2020
ansd and others added 3 commits September 15, 2020 14:32
Fixes #304

Exec on pods only if StatefulSet is ready and up to date.

Before this commit, we observed in #304 that the controller tried to
exec into pods at the same time as the pods got updated due to a StatefulSet restart
resulting in connection errors.
Relates to #304

Before this commit, the controller exec'ed into every RabbitMQ cluster pod in
every reconcile loop to idempotently 'rabbitmq-plugins set'.
Although simple and correct, these were unnecessary and expensive operations.

After this commit, the controller only execs into the pods if the
plugins config map got updated.
Because importing an internal library into the tests guarantees that the
test will always pass; this weakens the test slightly, as we want this
test to become red if we change the resource name because it would be a
user facing change i.e. a change in the expectation of what our product
creates.
@Zerpet Zerpet force-pushed the fix-error-after-config-map-change branch from 6e8f8b1 to f29f6bb Compare September 15, 2020 14:32
@Zerpet
Copy link
Collaborator

Zerpet commented Sep 15, 2020

I made the change in the system tests to not import the internal library and use a test constant. Given that we are 3 people at the moment and 2/3 have reviewed this, I will merge the PR when the checks pass.

@Zerpet Zerpet merged commit 8b92838 into main Sep 15, 2020
@Zerpet Zerpet deleted the fix-error-after-config-map-change branch September 15, 2020 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconcile Error after change the Config Map
5 participants