-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351
Comments
This looks great, thanks so much for putting it together and making the problem easy to understand. I was wondering about the "Unnecessary scaling of services" example -- my understanding here is that each service in Frontend has its own scaling group, so if the |
This is correct. I was speaking mostly about DCR. e.g.
It's tenuous as it should try to scale to the entire load. But it means we can't scale specifically to needs. Maybe it's a bad example. If it's still unclear, I might remove the example. |
Now that the main CDK migration is done, it would be worth considering whether we want to address some of these improvements before splitting the stacks as described in this issue. In particular, we should be using an ALB rather than an ELB |
Moving this ticket to the backlog (as discussed in planning we have a lot going on so we're deprioritising this for now) |
`minCapacity` was bumped from 15 to 30 in #8724. After the fronts migration was completed, it seemed that 15 instances weren't holding up with the new traffic we were sending. That was a hotfix and there is upcoming work to split our stacks (#8351) but until we reach this point we would like to see whether we could handle the traffic with 25 instances. This change is part of the research we're doing to optimise how we scale: #9322. Co-authored-by: George B <[email protected]> Co-authored-by: Ravi <[email protected]>
`minCapacity` was bumped from 15 to 30 in #8724. After the fronts migration was completed, it seemed that 15 instances weren't holding up with the new traffic we were sending. That was a hotfix and there is upcoming work to split our stacks (#8351) but until we reach this point we would like to see whether we could handle the traffic with 25 instances. This change is part of the research we're doing to optimise how we scale: #9322. Co-authored-by: George B <[email protected]> Co-authored-by: Ravi <[email protected]>
`minCapacity` was bumped from 15 to 30 in #8724. After the fronts migration was completed, it seemed that 15 instances weren't holding up with the new traffic we were sending. That was a hotfix and there is upcoming work to split our stacks (#8351) but until we reach this point we would like to see whether we could handle the traffic with 25 instances. This change is part of the research we're doing to optimise how we scale: #9322. Co-authored-by: George B <[email protected]> Co-authored-by: Ravi <[email protected]>
Depends on #7614Depends on #9310
The issue
DCR as a service in our infrastructure design (AWS) is a single point of failure as it serves content to multiple other swimlaned micro-services i.e.
frontend/article
(and LiveBlogs)frontend/facia
frontend/applications
InteractivesAre all served via DCR's 1 service.
This creates a host of issues, which we are starting to see real-world examples of happening, illustrated below, and we should look to address it before these issues become more impactful as we start to serve more traffic to the service via the apps rendering work.
Current request flow
Solutions
Co-Authored-By: @AshCorr
Co-Authored-By: @arelra
While we could create completely new apps for each service, this would create a load of upfront work to address the immediate issue of DCR being a bottleneck.
We have suggested that we stick to splitting the infrastructure first to mitigate the risks there. This would include:
Tasklist
As mentioned above - @AshCorr has suggested that we move to CDK (#7614) first to make this easier and more seamless.
The ongoing work with making apps webviews available via DCR is already uncovering how we might architect the application itself to be more geared towards a more micro-frontend structure, but is out of scope of this issue.
Suggested request flow
Option 1: One LB per app
Option 2: One LB for all DCR apps
An example of where else we do this is in MAPI via microservice CDK (thanks @JamieB-gu)
We had a chat with @akash1810 and they suggested it would be good to talk to someone from AWS to discuss which option would be better to go with considering the amount of traffic the load balancer(s) would get.
Depends on #9310
Examples
Performance of one app affects another
Traffic to one app affects another
router
➡️frontend/facia
➡️ DCRUnnecessary scaling of services
frontend/article
scales upfrontend/facia
is now running at the new scaled up versionError handling
/Article
endpointfrontend/article
/Front
endpoint slows down due to massive increase in trafficThe text was updated successfully, but these errors were encountered: