You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently Prometheus rule manager only restores for state of rule groups after restarts. This is fine for Prometheus. However, in Cortex, rule groups can jump from one ruler instance (r1) to another (r2) due to resharding. If r2 happens to be evaluating rule groups for that tenant already, then the manager will not restore the for state and will result in alerts going into an incorrect state. For example, an alert can go from FIRING to PENDING
To Reproduce
Create rules for a tenant with shard size > 1. For ease of testing, all the ruler instances were running rules for the tenant
Wait for alerting rule to go into FIRING
Restart the instance that was evaluating the alerting rule. Here the assumption is the ruler takes a bit to restart giving another ruler a chance to evaluate the alerting rule at least once
The alerting rule will go to PENDING
Expected behavior
The alert rule should stay in FIRING state
Additional Context
There is a PR open for Prometheus to address this issue. Without the PR approved, it is difficult to fix this issue
The text was updated successfully, but these errors were encountered:
rajagopalanand
changed the title
Ruler do not consistently restore for state
Ruler does not consistently restore for state
Dec 29, 2024
Description
Currently Prometheus rule manager only restores
for
state of rule groups after restarts. This is fine for Prometheus. However, in Cortex, rule groups can jump from one ruler instance (r1) to another (r2) due to resharding. If r2 happens to be evaluating rule groups for that tenant already, then the manager will not restore thefor
state and will result in alerts going into an incorrect state. For example, an alert can go fromFIRING
toPENDING
To Reproduce
FIRING
PENDING
Expected behavior
FIRING
stateAdditional Context
There is a PR open for Prometheus to address this issue. Without the PR approved, it is difficult to fix this issue
The text was updated successfully, but these errors were encountered: