Review DCR's ASG Configuration #8345

JamieB-gu · 2023-07-21T17:16:05Z

Now that we're serving fronts traffic it might be a good time to review, to establish whether we're:

Using the best instance types for our use-case
Scaled to an appropriate count, with adequate min and max counts that allow for RiffRaff deploys
Applying the best scaling policy for our needs

Suggestions derived from conversations with @alinaboghiu and @jacobwinch .

Tasks

Give feedback

Review ASG Sizes #9322
Review DCR Scaling Policy #9311

1 of 1
Review DCR Instance Types #9321
Migrate from using ELB to ALB Load Balancer Type for DCR #9310
SPIKE: Custom cluster mode using Node API #9963
Once Spin up multiple AWS stacks to stop DCR being a single source of failure in our request path #8351 is complete - revisit each new stack ASG size and instance types
Options

Instance Types

We're currently using t4g.small in production:

dotcom-rendering/dotcom-rendering/scripts/deploy/riff-raff.yaml

Lines 16 to 19 in e234547

    
           CODE: 
        
             InstanceType: t4g.micro 
        
           PROD: 
        
             InstanceType: t4g.small

Do we want to consider other instance types? Frontend uses a different type, as mentioned in #7440.

ASG Size

We're currently using a minimum of 15 and a maximum of 60 instances in production:

dotcom-rendering/dotcom-rendering/cloudformation.yml

Lines 72 to 78 in e234547

    
           StageMap: 
        
             PROD: 
        
               MinCapacity: 15 
        
               MaxCapacity: 60 
        
             CODE: 
        
               MinCapacity: 1 
        
               MaxCapacity: 4

Do these limits allow enough headroom for our additional traffic, scaling requirements, and RiffRaff deploys?

Scaling Policy

We're currently scaling based on latency:

dotcom-rendering/dotcom-rendering/cloudformation.yml

Lines 312 to 331 in e234547

    
             LatencyScalingAlarm: 
        
               Condition: HasLatencyScalingAlarm 
        
               Properties: 
        
                 AlarmDescription: !Sub | 
        
                   Scale-Up if latency is greater than 0.2 seconds over 1 period(s) of 60 seconds 
        
                 Dimensions: 
        
                   - Name: LoadBalancerName 
        
                     Value: !Ref InternalLoadBalancer 
        
                 EvaluationPeriods: '1' 
        
                 MetricName: Latency 
        
                 Namespace: AWS/ELB 
        
                 Period: '60' 
        
                 Statistic: Average 
        
                 Threshold: '0.2' 
        
                 ComparisonOperator: GreaterThanOrEqualToThreshold 
        
                 OKActions: 
        
                   - !Ref ScaleDownPolicy 
        
                 AlarmActions: 
        
                   - !Ref ScaleUpPolicy 
        
               Type: AWS::CloudWatch::Alarm

and scaling up by doubling our capacity every 10 minutes:

dotcom-rendering/dotcom-rendering/cloudformation.yml

Lines 304 to 310 in e234547

    
           ScaleUpPolicy: 
        
             Type: AWS::AutoScaling::ScalingPolicy 
        
             Properties: 
        
               AdjustmentType: PercentChangeInCapacity 
        
               AutoScalingGroupName: !Ref AutoscalingGroup 
        
               Cooldown: '600' 
        
               ScalingAdjustment: '100'

whilst scaling down by removing an instance once every 2 minutes:

dotcom-rendering/dotcom-rendering/cloudformation.yml

Lines 296 to 302 in e234547

    
           ScaleDownPolicy: 
        
             Type: AWS::AutoScaling::ScalingPolicy 
        
             Properties: 
        
               AdjustmentType: ChangeInCapacity 
        
               AutoScalingGroupName: !Ref AutoscalingGroup 
        
               Cooldown: '120' 
        
               ScalingAdjustment: '-1'

Do we want to consider other scaling strategies? Apps-rendering, for example, via guardian/cdk, scales based on a target CPU utilisation:

dotcom-rendering/apps-rendering/cdk/lib/mobile-apps-rendering.ts

Lines 104 to 106 in e234547

    
           asg.scaleOnCpuUtilization('CpuScalingPolicy', { 
        
           	targetUtilizationPercent: scalingTargetCpuUtilisation, 
        
           });

@akash1810 recommends to have a completed test before Christmas holidays and base on that we decide what to do.

The text was updated successfully, but these errors were encountered:

ParisaTork · 2023-07-24T08:04:38Z

Instance Types
T4G instances are suitable for workloads that don't use the full CPU often or consistently. Given the recent spikes in traffic (I don’t know what DCR’s CPU utilisation was at the time of the incidents, but Facia’s reached 100% (guardian/frontend#26336 (comment))), would it be worth having the C7G instances?

ASG Size
Do our instances scale down quickly when the cache is populated to reduce costs?

Scaling Policy
Do we have an SLA that explicitly sets out expected performance metrics? This would help us determine whether our scaling strategy should be based on latency, CPU utilisation, Network I/O etc. Also is it worth having some scheduled scaling between 12 and 1, when our traffic tends to peak: https://docs.google.com/document/d/1heaaDCNJ45uaZRsIUgJxdOc9vz3z5eIzf4sIMGJ4X44/edit?usp=sharing or is it more cost-effective and performant to have dynamic scaling all the time?

jamesgorrie · 2023-07-24T08:57:10Z

Something I think that is also worth adding to this discussion is how we might have scaling work in lockstep with frontend.

Out current setup of code, not-colocated, makes it seem as though there is no coupling between this AWS configuration, and frontend's AWS configuration.

I'd suggest there is strong coupling. If we scale up the frontend/facia app, we will send more traffic to dcr. We should account for this as failing at the dcr level will cause just as many issue, especially given its current "I serve HTML for all the things" setup.

We should probably consider that we are planning on being able to scale up separately across endpoints in dcr soon too to avoid that single point of failure.

jacobwinch · 2023-07-24T08:59:43Z

Thanks for writing this up @JamieB-gu!

Applying the best scaling policy for our needs

IIUC this app (and other Dotcom apps) are currently using simple scaling¹. AWS have essentially deprecated this style of scaling and are promoting their alternative solutions as a best practice:

I would recommend experimenting with these alternative strategies (regardless of which metric you decide to base your scaling on) if you have some time.

It's a bit tricky to see this, but simple scaling is the default, so if PolicyType is unspecified, you're using SimpleScaling. ↩

AshCorr · 2023-07-24T09:39:46Z

Worth noting that even though we're paying for a t4g.small instances with 2 vCPUs in PROD I'm not convinced we actually use the 2nd CPU for DCR. Due to Nodes single threaded nature it will only use one of the CPUs, usually something like PM2 would create multiple node instances for each CPU to fix that but I don't think we've configured it to do that in our case.

arelra · 2023-07-24T09:53:56Z

Worth noting that even though we're paying for a t4g.small instances with 2 vCPUs in PROD I'm not convinced we actually use the 2nd CPU for DCR. Due to Nodes single threaded nature it will only use one of the CPUs, usually something like PM2 would create multiple node instances for each CPU to fix that but I don't think we've configured it to do that in our case.

@AshCorr Snap. I'm looking at configuring clustering

jamesgorrie · 2023-07-24T15:53:35Z

We should probably consider that we are planning on being able to scale up separately across endpoints in dcr soon too to avoid that single point of failure.

Talking to myself - there is an issue that speaks to this and might be considered in conjunction with this as we can think of scaling as domain and traffic specific to those parts of the site.

rhiannareechaye · 2023-08-21T12:23:54Z

hey @jamesgorrie I've moved this into the triage column - what do you think the next step is for this piece of work. Do you think it should be prioritised as a 'high impact' ?

rhiannareechaye · 2023-08-29T12:39:21Z

Moving to the backlog - please note this isn't in our list for planning for the near future, please shout if you disagree though (and feel free to add to the list to be discussed in planning: https://docs.google.com/document/d/1-ls95KamOB-lvwKzTUfqpd3gSwcvgszC7fsMIOhXacM/edit)

rhiannareechaye · 2023-09-26T10:38:43Z

We'll be prioritising PM2/Vulnerabilities at the moment, and working out a way to surface this on a backlog (we'll try and trial a new approach to health for the new Q)

ioannakok · 2023-11-07T09:58:20Z

Progress so far: https://docs.google.com/document/d/1FjDANYnddGrB3JnXdMjoHnie2kB7fL3MgJachCVZHVg/edit

JamieB-gu added this to Deprecated: WebX Health and Rota Jul 21, 2023

github-project-automation bot moved this to Triage in Deprecated: WebX Health and Rota Jul 21, 2023

JamieB-gu added the Health label Jul 21, 2023

jamesgorrie added this to WebX Team Aug 3, 2023

jamesgorrie removed this from Deprecated: WebX Health and Rota Aug 3, 2023

github-project-automation bot moved this to Backlog in WebX Team Aug 3, 2023

jamesgorrie moved this from Backlog to Planned in WebX Team Aug 3, 2023

This was referenced Aug 11, 2023

infra: Migrate ScaleDownPolicy to CDK #8512

Closed

infra: Migrate ScaleUpPolicy to CDK #8514

Closed

optimisation: Phases >1 (CDK) #8556

Closed

VDuczekW moved this from Planned to Backlog in WebX Team Aug 14, 2023

rhiannareechaye moved this from Backlog to Triage in WebX Team Aug 21, 2023

rhiannareechaye moved this from Triage to Backlog in WebX Team Aug 29, 2023

ioannakok moved this from Backlog to Triage in WebX Team Sep 25, 2023

rhiannareechaye moved this from Triage to Backlog in WebX Team Sep 26, 2023

JamieB-gu added the Epic label Sep 26, 2023

cemms1 added this to the Health milestone Oct 4, 2023

cemms1 removed the Health label Oct 4, 2023

shesah mentioned this issue Oct 19, 2023

Scale DCR vertically instead of horizontally #6803

Closed

ioannakok mentioned this issue Oct 25, 2023

Review DCR Scaling Policy #9311

Closed

alinaboghiu removed this from the Health milestone Oct 26, 2023

alinaboghiu added this to the Scaling and AWS Config Health milestone Oct 26, 2023

alinaboghiu moved this from Backlog to Planned in WebX Team Oct 26, 2023

ioannakok assigned arelra, georgeblahblah and ioannakok Oct 30, 2023

ioannakok moved this from Planned to In progress in WebX Team Oct 30, 2023

ioannakok moved this from In progress to Planned in WebX Team Nov 7, 2023

ioannakok moved this from Planned to Backlog in WebX Team Nov 21, 2023

alinaboghiu moved this from Backlog to Planned in WebX Team Dec 13, 2023

ioannakok assigned alinaboghiu Dec 20, 2023

alinaboghiu moved this from This Sprint to Backlog in WebX Team Jan 12, 2024

JamieB-gu moved this from Backlog to Next Sprint in WebX Team Jul 9, 2024

JamieB-gu moved this from Next Sprint to This Sprint in WebX Team Jul 9, 2024

JamieB-gu moved this from This Sprint to In Progress in WebX Team Jul 12, 2024

JamieB-gu modified the milestones: Scaling and AWS Config Health, Health Jul 17, 2024

arelra closed this as completed Jul 22, 2024

github-project-automation bot moved this from In Progress to Done in WebX Team Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review DCR's ASG Configuration #8345

Review DCR's ASG Configuration #8345

JamieB-gu commented Jul 21, 2023 •

edited by arelra

Loading

Tasks

ParisaTork commented Jul 24, 2023 •

edited

Loading

jamesgorrie commented Jul 24, 2023

jacobwinch commented Jul 24, 2023

AshCorr commented Jul 24, 2023 •

edited

Loading

arelra commented Jul 24, 2023

jamesgorrie commented Jul 24, 2023

rhiannareechaye commented Aug 21, 2023

rhiannareechaye commented Aug 29, 2023

rhiannareechaye commented Sep 26, 2023

ioannakok commented Nov 7, 2023

Review DCR's ASG Configuration #8345

Review DCR's ASG Configuration #8345

Comments

JamieB-gu commented Jul 21, 2023 • edited by arelra Loading

Tasks

Instance Types

ASG Size

Scaling Policy

ParisaTork commented Jul 24, 2023 • edited Loading

jamesgorrie commented Jul 24, 2023

jacobwinch commented Jul 24, 2023

Footnotes

AshCorr commented Jul 24, 2023 • edited Loading

arelra commented Jul 24, 2023

jamesgorrie commented Jul 24, 2023

rhiannareechaye commented Aug 21, 2023

rhiannareechaye commented Aug 29, 2023

rhiannareechaye commented Sep 26, 2023

ioannakok commented Nov 7, 2023

JamieB-gu commented Jul 21, 2023 •

edited by arelra

Loading

ParisaTork commented Jul 24, 2023 •

edited

Loading

AshCorr commented Jul 24, 2023 •

edited

Loading