You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a pod crashes, it takes approximately 30 seconds to restart and begin a new cycle. In our environment, we observed that when Cloudbeat crashes, another pod takes over and becomes the new leader. However, since the new leader has already started its cycle, we are missing data in the findings index during the transition period.
By increasing the LeaseDuration, we can create a stickiness effect, ensuring that the lease remains assigned to the same pod after it crashes and is restarted by the Kubernetes infrastructure.
Definition of done
Consider changing the LeaseDuration into a number that will survive the crash.
Tradeoff
In high-availability systems where rapid failover is critical short LeaseDuration is recommended (probably not our case).
Motivation
When a pod crashes, it takes approximately 30 seconds to restart and begin a new cycle. In our environment, we observed that when Cloudbeat crashes, another pod takes over and becomes the new leader. However, since the new leader has already started its cycle, we are missing data in the findings index during the transition period.
By increasing the
LeaseDuration
, we can create a stickiness effect, ensuring that the lease remains assigned to the same pod after it crashes and is restarted by the Kubernetes infrastructure.Definition of done
LeaseDuration
into a number that will survive the crash.Tradeoff
In high-availability systems where rapid failover is critical short
LeaseDuration
is recommended (probably not our case).Issues
The text was updated successfully, but these errors were encountered: