You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not quite sure this is a bug, but it's certainly something we need to document and keep in mind. Right now, the operator has a lock and there's only one reconciliation loop at a time. Each loop will handle exactly one CR, meaning that a CR for instance-a will block CR updates related to instance-b. It will also block further updates to instance-a, which is probably the reason why the lock exists in the first place.
Typically, this isn't a big problem, as CR updates are quick. Most of the times, it will block only until the deployment/dependencies are handled, and it will happen almost instantly after the images are pulled. However, when there's a configuration problem (see #670 for an example), this will block until the reconciliation times out after 5 minutes.
The lock code comes from the operator-sdk, and we might be able to change it to a per-instance lock, so that the instance-b won't wait for instance-a to be applied. But updates to instance-a will still be blocked until a previous loop from instance-a has finished/timed out.
Right now, this is only a problem under error conditions, so, it might not be that critical, but we certainly want to experiment with locking only the resources we need, and perhaps send a signal to the loop, to cancel a reconciliation in case a new one exists in the queue.
The text was updated successfully, but these errors were encountered:
Good question. I don't remember seeing this in the scaffold for the OpenTelemetry operator, but I do remember reading about different leader election options.
Not quite sure this is a bug, but it's certainly something we need to document and keep in mind. Right now, the operator has a lock and there's only one reconciliation loop at a time. Each loop will handle exactly one CR, meaning that a CR for
instance-a
will block CR updates related toinstance-b
. It will also block further updates toinstance-a
, which is probably the reason why the lock exists in the first place.Typically, this isn't a big problem, as CR updates are quick. Most of the times, it will block only until the deployment/dependencies are handled, and it will happen almost instantly after the images are pulled. However, when there's a configuration problem (see #670 for an example), this will block until the reconciliation times out after 5 minutes.
The lock code comes from the operator-sdk, and we might be able to change it to a per-instance lock, so that the
instance-b
won't wait forinstance-a
to be applied. But updates toinstance-a
will still be blocked until a previous loop frominstance-a
has finished/timed out.Right now, this is only a problem under error conditions, so, it might not be that critical, but we certainly want to experiment with locking only the resources we need, and perhaps send a signal to the loop, to cancel a reconciliation in case a new one exists in the queue.
The text was updated successfully, but these errors were encountered: