Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review DCR's ASG Configuration #8345

Closed
6 tasks done
JamieB-gu opened this issue Jul 21, 2023 · 10 comments
Closed
6 tasks done

Review DCR's ASG Configuration #8345

JamieB-gu opened this issue Jul 21, 2023 · 10 comments
Assignees
Labels
Milestone

Comments

@JamieB-gu
Copy link
Contributor

JamieB-gu commented Jul 21, 2023

Now that we're serving fronts traffic it might be a good time to review, to establish whether we're:

  1. Using the best instance types for our use-case
  2. Scaled to an appropriate count, with adequate min and max counts that allow for RiffRaff deploys
  3. Applying the best scaling policy for our needs

Suggestions derived from conversations with @alinaboghiu and @jacobwinch .

Tasks

Preview Give feedback
  1. alinaboghiu arelra
    georgeblahblah ioannakok
  2. 1 of 1
    arelra
  3. alinaboghiu arelra
    georgeblahblah ioannakok
  4. cemms1
  5. alinaboghiu arelra
    georgeblahblah ioannakok

Instance Types

We're currently using t4g.small in production:

CODE:
InstanceType: t4g.micro
PROD:
InstanceType: t4g.small

Do we want to consider other instance types? Frontend uses a different type, as mentioned in #7440.

ASG Size

We're currently using a minimum of 15 and a maximum of 60 instances in production:

StageMap:
PROD:
MinCapacity: 15
MaxCapacity: 60
CODE:
MinCapacity: 1
MaxCapacity: 4

Do these limits allow enough headroom for our additional traffic, scaling requirements, and RiffRaff deploys?

Scaling Policy

We're currently scaling based on latency:

LatencyScalingAlarm:
Condition: HasLatencyScalingAlarm
Properties:
AlarmDescription: !Sub |
Scale-Up if latency is greater than 0.2 seconds over 1 period(s) of 60 seconds
Dimensions:
- Name: LoadBalancerName
Value: !Ref InternalLoadBalancer
EvaluationPeriods: '1'
MetricName: Latency
Namespace: AWS/ELB
Period: '60'
Statistic: Average
Threshold: '0.2'
ComparisonOperator: GreaterThanOrEqualToThreshold
OKActions:
- !Ref ScaleDownPolicy
AlarmActions:
- !Ref ScaleUpPolicy
Type: AWS::CloudWatch::Alarm

and scaling up by doubling our capacity every 10 minutes:

ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: PercentChangeInCapacity
AutoScalingGroupName: !Ref AutoscalingGroup
Cooldown: '600'
ScalingAdjustment: '100'

whilst scaling down by removing an instance once every 2 minutes:

ScaleDownPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref AutoscalingGroup
Cooldown: '120'
ScalingAdjustment: '-1'

Do we want to consider other scaling strategies? Apps-rendering, for example, via guardian/cdk, scales based on a target CPU utilisation:

asg.scaleOnCpuUtilization('CpuScalingPolicy', {
targetUtilizationPercent: scalingTargetCpuUtilisation,
});

@akash1810 recommends to have a completed test before Christmas holidays and base on that we decide what to do.

@ParisaTork
Copy link
Contributor

ParisaTork commented Jul 24, 2023

Instance Types
T4G instances are suitable for workloads that don't use the full CPU often or consistently. Given the recent spikes in traffic (I don’t know what DCR’s CPU utilisation was at the time of the incidents, but Facia’s reached 100% (guardian/frontend#26336 (comment))), would it be worth having the C7G instances?

ASG Size
Do our instances scale down quickly when the cache is populated to reduce costs?

Scaling Policy
Do we have an SLA that explicitly sets out expected performance metrics? This would help us determine whether our scaling strategy should be based on latency, CPU utilisation, Network I/O etc. Also is it worth having some scheduled scaling between 12 and 1, when our traffic tends to peak: https://docs.google.com/document/d/1heaaDCNJ45uaZRsIUgJxdOc9vz3z5eIzf4sIMGJ4X44/edit?usp=sharing or is it more cost-effective and performant to have dynamic scaling all the time?

@jamesgorrie
Copy link
Contributor

Something I think that is also worth adding to this discussion is how we might have scaling work in lockstep with frontend.

Out current setup of code, not-colocated, makes it seem as though there is no coupling between this AWS configuration, and frontend's AWS configuration.

I'd suggest there is strong coupling. If we scale up the frontend/facia app, we will send more traffic to dcr. We should account for this as failing at the dcr level will cause just as many issue, especially given its current "I serve HTML for all the things" setup.

We should probably consider that we are planning on being able to scale up separately across endpoints in dcr soon too to avoid that single point of failure.

@jacobwinch
Copy link
Contributor

Thanks for writing this up @JamieB-gu!

  1. Applying the best scaling policy for our needs

IIUC this app (and other Dotcom apps) are currently using simple scaling1. AWS have essentially deprecated this style of scaling and are promoting their alternative solutions as a best practice:

image

I would recommend experimenting with these alternative strategies (regardless of which metric you decide to base your scaling on) if you have some time.

Footnotes

  1. It's a bit tricky to see this, but simple scaling is the default, so if PolicyType is unspecified, you're using SimpleScaling.

@AshCorr
Copy link
Member

AshCorr commented Jul 24, 2023

Worth noting that even though we're paying for a t4g.small instances with 2 vCPUs in PROD I'm not convinced we actually use the 2nd CPU for DCR. Due to Nodes single threaded nature it will only use one of the CPUs, usually something like PM2 would create multiple node instances for each CPU to fix that but I don't think we've configured it to do that in our case.

@arelra
Copy link
Member

arelra commented Jul 24, 2023

Worth noting that even though we're paying for a t4g.small instances with 2 vCPUs in PROD I'm not convinced we actually use the 2nd CPU for DCR. Due to Nodes single threaded nature it will only use one of the CPUs, usually something like PM2 would create multiple node instances for each CPU to fix that but I don't think we've configured it to do that in our case.

@AshCorr Snap. I'm looking at configuring clustering

@jamesgorrie
Copy link
Contributor

We should probably consider that we are planning on being able to scale up separately across endpoints in dcr soon too to avoid that single point of failure.

Talking to myself - there is an issue that speaks to this and might be considered in conjunction with this as we can think of scaling as domain and traffic specific to those parts of the site.

@github-project-automation github-project-automation bot moved this to Backlog in WebX Team Aug 3, 2023
@jamesgorrie jamesgorrie moved this from Backlog to Planned in WebX Team Aug 3, 2023
@VDuczekW VDuczekW moved this from Planned to Backlog in WebX Team Aug 14, 2023
@rhiannareechaye rhiannareechaye moved this from Backlog to Triage in WebX Team Aug 21, 2023
@rhiannareechaye
Copy link
Contributor

hey @jamesgorrie I've moved this into the triage column - what do you think the next step is for this piece of work. Do you think it should be prioritised as a 'high impact' ?

@rhiannareechaye
Copy link
Contributor

Moving to the backlog - please note this isn't in our list for planning for the near future, please shout if you disagree though (and feel free to add to the list to be discussed in planning: https://docs.google.com/document/d/1-ls95KamOB-lvwKzTUfqpd3gSwcvgszC7fsMIOhXacM/edit)

@rhiannareechaye rhiannareechaye moved this from Triage to Backlog in WebX Team Aug 29, 2023
@ioannakok ioannakok moved this from Backlog to Triage in WebX Team Sep 25, 2023
@rhiannareechaye
Copy link
Contributor

We'll be prioritising PM2/Vulnerabilities at the moment, and working out a way to surface this on a backlog (we'll try and trial a new approach to health for the new Q)

@rhiannareechaye rhiannareechaye moved this from Triage to Backlog in WebX Team Sep 26, 2023
@JamieB-gu JamieB-gu added the Epic label Sep 26, 2023
@cemms1 cemms1 added this to the Health milestone Oct 4, 2023
@cemms1 cemms1 removed the Health label Oct 4, 2023
@alinaboghiu alinaboghiu removed this from the Health milestone Oct 26, 2023
@alinaboghiu alinaboghiu moved this from Backlog to Planned in WebX Team Oct 26, 2023
@ioannakok ioannakok moved this from Planned to In progress in WebX Team Oct 30, 2023
@ioannakok
Copy link
Contributor

@ioannakok ioannakok moved this from In progress to Planned in WebX Team Nov 7, 2023
@ioannakok ioannakok moved this from Planned to Backlog in WebX Team Nov 21, 2023
@alinaboghiu alinaboghiu moved this from Backlog to Planned in WebX Team Dec 13, 2023
@alinaboghiu alinaboghiu moved this from This Sprint to Backlog in WebX Team Jan 12, 2024
@JamieB-gu JamieB-gu moved this from Backlog to Next Sprint in WebX Team Jul 9, 2024
@JamieB-gu JamieB-gu moved this from Next Sprint to This Sprint in WebX Team Jul 9, 2024
@JamieB-gu JamieB-gu moved this from This Sprint to In Progress in WebX Team Jul 12, 2024
@arelra arelra closed this as completed Jul 22, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in WebX Team Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests