You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$\mathrm{5XX_{Instance}}$ means the count of 5XX responses from the instances over a given time period, and the other variables likewise for other response codes and the ELB.
Most traffic has moved over to the article-rendering app, which means several counts like $\mathrm{2XX_{Instance}}$ have dropped. However, the count of $\mathrm{5XX_{ELB}}$ specifically doesn't seem to have changed noticeably, so the $\mathrm{ObservedFailureRate}$ is now higher, and over the threshold, which triggers the alarm.
We have temporarily updated the alarm configuration to prevent it triggering too often. We're going to investigate the underlying problem further in #10392.
We've been seeing several 5XX alarms on the
rendering
Guardian app/CDK stack.Why?
We believe these are triggered when:
where:
5XX
responses from the instances over a given time period, and the other variables likewise for other response codes and the ELB.Most traffic has moved over to the$\mathrm{2XX_{Instance}}$ have dropped. However, the count of $\mathrm{5XX_{ELB}}$ specifically doesn't seem to have changed noticeably, so the $\mathrm{ObservedFailureRate}$ is now higher, and over the threshold, which triggers the alarm.
article-rendering
app, which means several counts likeBackground
We now have two Guardian apps/CDK stacks as of #9955 and guardian/frontend#26821 :
rendering
article-rendering
article-rendering
is serving article traffic1 andrendering
is serving everything else (fronts, interactives etc.).Possible Solutions
Adding the request path to the error messages here:
https://github.com/guardian/frontend/blob/0eb925c9e181825bb5da25ac500494d07c92a254/common/app/renderers/DotcomRenderingService.scala#L127
and/or here:
https://github.com/guardian/frontend/blob/0eb925c9e181825bb5da25ac500494d07c92a254/common/app/renderers/DotcomRenderingService.scala#L252
may help us to narrow down the pages causing the errors.
We could also temporarily change the$Threshold$ to take us out of the alarm state as a mitigation.
Footnotes
From
frontend
'sarticle
microservice, plus some content fromapplications
likeArticleDesign.Picture
. ↩The text was updated successfully, but these errors were encountered: