Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rendering App Stuck In Alarm Since Traffic Move To Article Rendering App #10265

Closed
JamieB-gu opened this issue Jan 19, 2024 · 1 comment
Closed
Assignees
Milestone

Comments

@JamieB-gu
Copy link
Contributor

JamieB-gu commented Jan 19, 2024

We've been seeing several 5XX alarms on the rendering Guardian app/CDK stack.

Why?

We believe these are triggered when:

$$ \mathrm{ObservedFailureRate} > \mathrm{Threshold} $$

where:

$$ \mathrm{Observed Failure Rate} = \frac{\mathrm{5XX_{Instance}} + \mathrm{5XX_{ELB}}}{\mathrm{2XX_{Instance}} + \mathrm{3XX_{Instance}} + \mathrm{4XX_{Instance}} + \mathrm{5XX_{Instance}} + \mathrm{4XX_{ELB}} + \mathrm{5XX_{ELB}}} $$

$\mathrm{5XX_{Instance}}$ means the count of 5XX responses from the instances over a given time period, and the other variables likewise for other response codes and the ELB.

Most traffic has moved over to the article-rendering app, which means several counts like $\mathrm{2XX_{Instance}}$ have dropped. However, the count of $\mathrm{5XX_{ELB}}$ specifically doesn't seem to have changed noticeably, so the $\mathrm{ObservedFailureRate}$ is now higher, and over the threshold, which triggers the alarm.

Background

We now have two Guardian apps/CDK stacks as of #9955 and guardian/frontend#26821 :

  • rendering
  • article-rendering

article-rendering is serving article traffic1 and rendering is serving everything else (fronts, interactives etc.).

Possible Solutions

Adding the request path to the error messages here:

https://github.com/guardian/frontend/blob/0eb925c9e181825bb5da25ac500494d07c92a254/common/app/renderers/DotcomRenderingService.scala#L127

and/or here:

https://github.com/guardian/frontend/blob/0eb925c9e181825bb5da25ac500494d07c92a254/common/app/renderers/DotcomRenderingService.scala#L252

may help us to narrow down the pages causing the errors.

We could also temporarily change the $Threshold$ to take us out of the alarm state as a mitigation.

Footnotes

  1. From frontend's article microservice, plus some content from applications like ArticleDesign.Picture.

@JamieB-gu JamieB-gu added this to the Health milestone Jan 19, 2024
@github-project-automation github-project-automation bot moved this to Triage in WebX Team Jan 19, 2024
@JamieB-gu JamieB-gu moved this from Triage to In Progress in WebX Team Jan 19, 2024
@JamieB-gu JamieB-gu moved this from In Progress to This Sprint in WebX Team Jan 19, 2024
@JamieB-gu JamieB-gu removed their assignment Jan 22, 2024
@JamieB-gu JamieB-gu moved this from This Sprint to In Progress in WebX Team Jan 25, 2024
@JamieB-gu
Copy link
Contributor Author

We have temporarily updated the alarm configuration to prevent it triggering too often. We're going to investigate the underlying problem further in #10392.

@github-project-automation github-project-automation bot moved this from In Progress to Done in WebX Team Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants