Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351

jamesgorrie · 2023-07-24T15:50:25Z

~~Depends on #7614~~
Depends on #9310

The issue

DCR as a service in our infrastructure design (AWS) is a single point of failure as it serves content to multiple other swimlaned micro-services i.e.

Are all served via DCR's 1 service.

This creates a host of issues, which we are starting to see real-world examples of happening, illustrated below, and we should look to address it before these issues become more impactful as we start to serve more traffic to the service via the apps rendering work.

Current request flow

---
title: Current request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp-->DcrLB
    FEFaciaApp-->DcrLB
    FEApplicationsApp-->DcrLB
    DcrLB-->DCR

Solutions

Co-Authored-By: @AshCorr
Co-Authored-By: @arelra

While we could create completely new apps for each service, this would create a load of upfront work to address the immediate issue of DCR being a bottleneck.

We have suggested that we stick to splitting the infrastructure first to mitigate the risks there. This would include:

Tasklist

Give feedback

Migrate from using ELB to ALB Load Balancer Type for DCR #9310
Create Fronts DCR App #9323
Create Interactives DCR App #9324
Create Miscalleneous (TBC) DCR App #9325
Loading the config of those values into each services SSM for rendering.baseURL
Testing testing
Options

As mentioned above - @AshCorr has suggested that we move to CDK (#7614) first to make this easier and more seamless.

The ongoing work with making apps webviews available via DCR is already uncovering how we might architect the application itself to be more geared towards a more micro-frontend structure, but is out of scope of this issue.

Suggested request flow

Option 1: One LB per app

---
title: Suggested request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp--/Article-->DcrArticleLB-->DcrApp1
    FEFaciaApp--/Front-->DcrFaciaLB-->DcrApp2
    FEApplicationsApp--/Interactives-->DcrApplicationLB-->DcrApp3

Option 2: One LB for all DCR apps

---
title: Suggested request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp--/Article-->DcrLB-->DcrArticleApp1
    FEFaciaApp--/Front-->DcrLB-->DcrFaciaApp2
    FEApplicationsApp--/Interactives-->DcrLB-->DcrApplicationsApp3

An example of where else we do this is in MAPI via microservice CDK (thanks @JamieB-gu)

We had a chat with @akash1810 and they suggested it would be good to talk to someone from AWS to discuss which option would be better to go with considering the amount of traffic the load balancer(s) would get.

Depends on #9310

Examples

Performance of one app affects another

We receive a blast of traffic to interactives e.g.
This locks up the threads in DCR due to the size of the JSON being parsed in those articles
DCR slows in terms of performance
⚠️ Point of failure: DCR articles and fronts are also served slow

Traffic to one app affects another

We receive a blast of traffic to fronts
This goes through router ➡️ frontend/facia ➡️ DCR
DCR slows in terms of performance
⚠️ Point of failure: DCR articles are also served slow

Unnecessary scaling of services

Traffic to articles increase
frontend/article scales up
DCR in turn scales up
⚠️ Point of failure:frontend/facia is now running at the new scaled up version

Error handling

We bork something on the /Article endpoint
This pushes 500s to frontend/article
This bubbles through our request pipeline
Cache will catch a lot of this, but we will see a larger % of traffic to origin to try get the valid response
⚠️ Point of failure: /Front endpoint slows down due to massive increase in traffic

The text was updated successfully, but these errors were encountered:

georgeblahblah · 2023-07-24T16:52:43Z

This looks great, thanks so much for putting it together and making the problem easy to understand.

I was wondering about the "Unnecessary scaling of services" example -- my understanding here is that each service in Frontend has its own scaling group, so if the article service scales to handle increased load, that wouldn't necessarily mean facia scales in tandem.

jamesgorrie · 2023-07-27T13:19:23Z

each service in Frontend has its own scaling group, so if the article service scales to handle increased load, that wouldn't necessarily mean facia scales in tandem.

This is correct. I was speaking mostly about DCR. e.g.

frontend/facia scales due to traffic
more traffic is sent to DCR
DCR as a whole scales - so frontend/article is now being served by 100 instances, where that isn't needed

It's tenuous as it should try to scale to the entire load. But it means we can't scale specifically to needs.

Maybe it's a bad example. If it's still unclear, I might remove the example.

cemms1 · 2023-08-25T10:21:49Z

Now that the main CDK migration is done, it would be worth considering whether we want to address some of these improvements before splitting the stacks as described in this issue.

In particular, we should be using an ALB rather than an ELB

rhiannareechaye · 2023-08-30T09:29:22Z

Moving this ticket to the backlog (as discussed in planning we have a lot going on so we're deprioritising this for now)

`minCapacity` was bumped from 15 to 30 in #8724. After the fronts migration was completed, it seemed that 15 instances weren't holding up with the new traffic we were sending. That was a hotfix and there is upcoming work to split our stacks (#8351) but until we reach this point we would like to see whether we could handle the traffic with 25 instances. This change is part of the research we're doing to optimise how we scale: #9322. Co-authored-by: George B <[email protected]> Co-authored-by: Ravi <[email protected]>

jamesgorrie added this to Deprecated: WebX Health and Rota Jul 27, 2023

github-project-automation bot moved this to Triage in Deprecated: WebX Health and Rota Jul 27, 2023

jamesgorrie added the Health label Jul 27, 2023

rhiannareechaye added this to the [P3] Health work spawned from the incidents milestone Jul 27, 2023

rhiannareechaye removed this from Deprecated: WebX Health and Rota Jul 27, 2023

rhiannareechaye added this to WebX Team Jul 27, 2023

github-project-automation bot moved this to Backlog in WebX Team Jul 27, 2023

VDuczekW moved this from Backlog to Planned in WebX Team Jul 31, 2023

VDuczekW assigned cemms1 and DanielCliftonGuardian Jul 31, 2023

ParisaTork self-assigned this Aug 11, 2023

ParisaTork mentioned this issue Aug 11, 2023

optimisation: Phases >1 (CDK) #8556

Closed

9 tasks

VDuczekW moved this from Planned to Backlog in WebX Team Aug 14, 2023

rhiannareechaye moved this from Backlog to Planned in WebX Team Aug 21, 2023

rhiannareechaye moved this from Planned to Backlog in WebX Team Aug 30, 2023

This was referenced Sep 1, 2023

feat: add front rendering app #8728

Merged

⛔️ [DO NOT MERGE] ⛔️ feat: use FrontRemoteRenderer behind a switch guardian/frontend#26556

Closed

jamesgorrie mentioned this issue Sep 21, 2023

Investigate whether renderTime metric can be replaced with ELB checks #8811

Open

cemms1 modified the milestones: [P0.1] Health work spawned from the incidents , Health Oct 4, 2023

cemms1 removed the Health label Oct 4, 2023

cemms1 unassigned cemms1, ParisaTork and DanielCliftonGuardian Oct 25, 2023

alinaboghiu modified the milestones: Health, Scaling and AWS Config Health Oct 26, 2023

alinaboghiu moved this from Backlog to Planned in WebX Team Oct 26, 2023

alinaboghiu added the Epic label Oct 26, 2023

This was referenced Oct 26, 2023

Create Fronts DCR App #9323

Closed

Create Interactives DCR App #9324

Closed

Create Miscalleneous (TBC) DCR App #9325

Open

ioannakok mentioned this issue Oct 31, 2023

Decrease minCapacity of instances to 27 #9369

Merged

mxdvl mentioned this issue Nov 6, 2023

migrate CDK to GuCDK Ec2 pattern #9389

Closed

6 tasks

ioannakok moved this from Planned to Backlog in WebX Team Nov 21, 2023

alinaboghiu moved this from Backlog to Planned in WebX Team Dec 13, 2023

cemms1 mentioned this issue Dec 19, 2023

Add new article stack using GuEc2App pattern for dual stack migration #9955

Merged

alinaboghiu moved this from This Sprint to Backlog in WebX Team Jan 12, 2024

alinaboghiu mentioned this issue Jan 12, 2024

Review DCR's ASG Configuration #8345

Closed

cemms1 mentioned this issue Jan 16, 2024

Add ability to forward DCR article traffic to new article-rendering app guardian/frontend#26816

Closed

ioannakok mentioned this issue Jan 25, 2024

Crossword picker and JSON route guardian/frontend#26843

Merged

2 tasks

This was referenced Feb 1, 2024

Add new interactive-rendering app to CDK #10458

Merged

Send interactives DCR traffic to new interactive-rendering app guardian/frontend#26866

Merged

JamieB-gu removed the Epic label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351

Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351

jamesgorrie commented Jul 24, 2023 •

edited by ioannakok

Loading

Tasklist

georgeblahblah commented Jul 24, 2023

jamesgorrie commented Jul 27, 2023

cemms1 commented Aug 25, 2023

rhiannareechaye commented Aug 30, 2023

Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351

Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351

Comments

jamesgorrie commented Jul 24, 2023 • edited by ioannakok Loading

The issue

Current request flow

Solutions

Tasklist

Suggested request flow

Option 1: One LB per app

Option 2: One LB for all DCR apps

Examples

Performance of one app affects another

Traffic to one app affects another

Unnecessary scaling of services

Error handling

georgeblahblah commented Jul 24, 2023

jamesgorrie commented Jul 27, 2023

cemms1 commented Aug 25, 2023

rhiannareechaye commented Aug 30, 2023

jamesgorrie commented Jul 24, 2023 •

edited by ioannakok

Loading