Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351

Open
4 of 6 tasks
jamesgorrie opened this issue Jul 24, 2023 · 4 comments
Open
4 of 6 tasks

Comments

@jamesgorrie
Copy link
Contributor

jamesgorrie commented Jul 24, 2023

Depends on #7614
Depends on #9310

The issue

DCR as a service in our infrastructure design (AWS) is a single point of failure as it serves content to multiple other swimlaned micro-services i.e.

Are all served via DCR's 1 service.

This creates a host of issues, which we are starting to see real-world examples of happening, illustrated below, and we should look to address it before these issues become more impactful as we start to serve more traffic to the service via the apps rendering work.

Current request flow

---
title: Current request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp-->DcrLB
    FEFaciaApp-->DcrLB
    FEApplicationsApp-->DcrLB
    DcrLB-->DCR
Loading

Solutions

Co-Authored-By: @AshCorr
Co-Authored-By: @arelra

While we could create completely new apps for each service, this would create a load of upfront work to address the immediate issue of DCR being a bottleneck.

We have suggested that we stick to splitting the infrastructure first to mitigate the risks there. This would include:

Tasklist

Preview Give feedback
  1. cemms1
  2. cemms1
  3. cemms1

As mentioned above - @AshCorr has suggested that we move to CDK (#7614) first to make this easier and more seamless.

The ongoing work with making apps webviews available via DCR is already uncovering how we might architect the application itself to be more geared towards a more micro-frontend structure, but is out of scope of this issue.

Suggested request flow

Option 1: One LB per app

---
title: Suggested request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp--/Article-->DcrArticleLB-->DcrApp1
    FEFaciaApp--/Front-->DcrFaciaLB-->DcrApp2
    FEApplicationsApp--/Interactives-->DcrApplicationLB-->DcrApp3
Loading

Option 2: One LB for all DCR apps

---
title: Suggested request flow
---
graph LR
    Request-->Fastly
    Fastly-->Router
    Router-->FEArticleLB
    Router-->FEFaciaLB
    Router-->FEApplicationsLB
    subgraph Frontend
        FEArticleLB-->FEArticleApp
        FEFaciaLB-->FEFaciaApp
        FEApplicationsLB-->FEApplicationsApp
    end
    FEArticleApp--/Article-->DcrLB-->DcrArticleApp1
    FEFaciaApp--/Front-->DcrLB-->DcrFaciaApp2
    FEApplicationsApp--/Interactives-->DcrLB-->DcrApplicationsApp3
Loading

An example of where else we do this is in MAPI via microservice CDK (thanks @JamieB-gu)

We had a chat with @akash1810 and they suggested it would be good to talk to someone from AWS to discuss which option would be better to go with considering the amount of traffic the load balancer(s) would get.

Depends on #9310


Examples

Performance of one app affects another

  • We receive a blast of traffic to interactives e.g.
  • This locks up the threads in DCR due to the size of the JSON being parsed in those articles
  • DCR slows in terms of performance
  • ⚠️ Point of failure: DCR articles and fronts are also served slow

Traffic to one app affects another

  • We receive a blast of traffic to fronts
  • This goes through router ➡️ frontend/facia ➡️ DCR
  • DCR slows in terms of performance
  • ⚠️ Point of failure: DCR articles are also served slow

Unnecessary scaling of services

  • Traffic to articles increase
  • frontend/article scales up
  • DCR in turn scales up
  • ⚠️ Point of failure:frontend/facia is now running at the new scaled up version

Error handling

  • We bork something on the /Article endpoint
  • This pushes 500s to frontend/article
  • This bubbles through our request pipeline
  • Cache will catch a lot of this, but we will see a larger % of traffic to origin to try get the valid response
  • ⚠️ Point of failure: /Front endpoint slows down due to massive increase in traffic

@georgeblahblah
Copy link
Contributor

This looks great, thanks so much for putting it together and making the problem easy to understand.

I was wondering about the "Unnecessary scaling of services" example -- my understanding here is that each service in Frontend has its own scaling group, so if the article service scales to handle increased load, that wouldn't necessarily mean facia scales in tandem.

@jamesgorrie
Copy link
Contributor Author

each service in Frontend has its own scaling group, so if the article service scales to handle increased load, that wouldn't necessarily mean facia scales in tandem.

This is correct. I was speaking mostly about DCR. e.g.

  • frontend/facia scales due to traffic
  • more traffic is sent to DCR
  • DCR as a whole scales - so frontend/article is now being served by 100 instances, where that isn't needed

It's tenuous as it should try to scale to the entire load. But it means we can't scale specifically to needs.

Maybe it's a bad example. If it's still unclear, I might remove the example.

@cemms1
Copy link
Contributor

cemms1 commented Aug 25, 2023

Now that the main CDK migration is done, it would be worth considering whether we want to address some of these improvements before splitting the stacks as described in this issue.

In particular, we should be using an ALB rather than an ELB

@rhiannareechaye
Copy link
Contributor

Moving this ticket to the backlog (as discussed in planning we have a lot going on so we're deprioritising this for now)

@alinaboghiu alinaboghiu moved this from Backlog to Planned in WebX Team Oct 26, 2023
ioannakok added a commit that referenced this issue Oct 31, 2023
`minCapacity` was bumped from 15 to 30 in #8724. After the fronts migration was completed, it seemed that 15 instances weren't holding up with the new traffic we were sending. That was a hotfix and there is upcoming work to split our stacks (#8351) but until we reach this point we would like to see whether we could handle the traffic with 25 instances. This change is part of the research we're doing to optimise how we scale: #9322.

Co-authored-by: George B <[email protected]>
Co-authored-by: Ravi <[email protected]>
ioannakok added a commit that referenced this issue Oct 31, 2023
`minCapacity` was bumped from 15 to 30 in #8724. After the fronts migration was completed, it seemed that 15 instances weren't holding up with the new traffic we were sending. That was a hotfix and there is upcoming work to split our stacks (#8351) but until we reach this point we would like to see whether we could handle the traffic with 25 instances. This change is part of the research we're doing to optimise how we scale: #9322.

Co-authored-by: George B <[email protected]>
Co-authored-by: Ravi <[email protected]>
ioannakok added a commit that referenced this issue Nov 1, 2023
`minCapacity` was bumped from 15 to 30 in #8724. After the fronts migration was completed, it seemed that 15 instances weren't holding up with the new traffic we were sending. That was a hotfix and there is upcoming work to split our stacks (#8351) but until we reach this point we would like to see whether we could handle the traffic with 25 instances. This change is part of the research we're doing to optimise how we scale: #9322.

Co-authored-by: George B <[email protected]>
Co-authored-by: Ravi <[email protected]>
@ioannakok ioannakok moved this from Planned to Backlog in WebX Team Nov 21, 2023
@alinaboghiu alinaboghiu moved this from Backlog to Planned in WebX Team Dec 13, 2023
@alinaboghiu alinaboghiu moved this from This Sprint to Backlog in WebX Team Jan 12, 2024
@JamieB-gu JamieB-gu removed the Epic label Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

8 participants