Operator is blocked while waiting for resources to stabilize #678

jpkrohling · 2019-09-27T08:55:37Z

Not quite sure this is a bug, but it's certainly something we need to document and keep in mind. Right now, the operator has a lock and there's only one reconciliation loop at a time. Each loop will handle exactly one CR, meaning that a CR for instance-a will block CR updates related to instance-b. It will also block further updates to instance-a, which is probably the reason why the lock exists in the first place.

Typically, this isn't a big problem, as CR updates are quick. Most of the times, it will block only until the deployment/dependencies are handled, and it will happen almost instantly after the images are pulled. However, when there's a configuration problem (see #670 for an example), this will block until the reconciliation times out after 5 minutes.

The lock code comes from the operator-sdk, and we might be able to change it to a per-instance lock, so that the instance-b won't wait for instance-a to be applied. But updates to instance-a will still be blocked until a previous loop from instance-a has finished/timed out.

Right now, this is only a problem under error conditions, so, it might not be that critical, but we certainly want to experiment with locking only the resources we need, and perhaps send a signal to the loop, to cancel a reconciliation in case a new one exists in the queue.

The text was updated successfully, but these errors were encountered:

objectiser · 2019-09-27T09:00:10Z

we might be able to change it to a per-instance lock

Assume this means requesting an enhancement in the sdk? Or do you think they already support this with appropriate configuration?

jpkrohling · 2019-09-27T09:10:16Z

Good question. I don't remember seeing this in the scaffold for the OpenTelemetry operator, but I do remember reading about different leader election options.

jpkrohling added enhancement New feature or request hacktoberfest labels Sep 27, 2019

jpkrohling added needs-triage New issues, in need of classification and removed needs-triage New issues, in need of classification labels Dec 16, 2019

jpkrohling closed this as completed Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator is blocked while waiting for resources to stabilize #678

Operator is blocked while waiting for resources to stabilize #678

jpkrohling commented Sep 27, 2019

objectiser commented Sep 27, 2019

jpkrohling commented Sep 27, 2019

Operator is blocked while waiting for resources to stabilize #678

Operator is blocked while waiting for resources to stabilize #678

Comments

jpkrohling commented Sep 27, 2019

objectiser commented Sep 27, 2019

jpkrohling commented Sep 27, 2019