-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decrease minCapacity
of instances to 27
#9369
Conversation
Size Change: 0 B Total Size: 697 kB ℹ️ View Unchanged
|
1367910
to
648d1cd
Compare
Any thoughts from @guardian/devx-reliability on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎛️
I agree that it makes sense to experiment with the minimum capacity here 👍 IIUC we are unsure exactly how low we can set the minimum without impacting users? However, we know that the cost of deploying this type of change is very low, so I would be tempted to drop the capacity in groups of 31 and monitor at each stage using the metrics that you've suggested (stopping somewhere before we get to 15 where there are known problems!). I think it'd also be worth documenting the cost saving per month that we achieve for each capacity reduction to help to prioritise how far we want to push this experimentation. Footnotes
|
@jacobwinch made a good point in the review about dropping the capacity in groups of 3 and monitoring at each stage. It's good to modify ASG capacity in groups of 3 so that AWS can evenly distribute across Availability Zones (https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-benefits.html#arch-AutoScalingMultiAZ). #9369 (comment) Co-authored-by: Jacob Winch <[email protected]>
`minCapacity` was bumped from 15 to 30 in #8724. After the fronts migration was completed, it seemed that 15 instances weren't holding up with the new traffic we were sending. That was a hotfix and there is upcoming work to split our stacks (#8351) but until we reach this point we would like to see whether we could handle the traffic with 25 instances. This change is part of the research we're doing to optimise how we scale: #9322. Co-authored-by: George B <[email protected]> Co-authored-by: Ravi <[email protected]>
@jacobwinch made a good point in the review about dropping the capacity in groups of 3 and monitoring at each stage. It's good to modify ASG capacity in groups of 3 so that AWS can evenly distribute across Availability Zones (https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-benefits.html#arch-AutoScalingMultiAZ). #9369 (comment) Co-authored-by: Jacob Winch <[email protected]>
7c20146
to
5eb07b1
Compare
minCapacity
of instances to 25minCapacity
of instances to 27
Thanks for the review @jacobwinch! That all makes perfect sense 👍 I have set |
After this change we've noticed:
We will check again at the end of the month and compare with October to have a better view of the impact. |
What does this change?
Reduces DCR
minCapacity
to 27.Why?
minCapacity
was bumped from 15 to 30 inAfter the fronts migration was completed, it seemed that 15 instances weren't holding up with the new traffic. Bumping instances to 30 was a hotfix and there is upcoming work to split our stacks. Until we reach this point we would like to see whether we could handle the traffic with 27 instances. This change is a first step of the research we're doing to how to optimise DCR scaling in terms of:
See WIP document on our research and findings so far.
Metrics to keep an eye on
After the fronts migration and with DCR having a
minCapacity
of 15 @guardian/devx-reliability team had to lower the SLO target forfacia
app in https://github.com/guardian/slo-alerts/pull/37 after ongoing issues with DCR latency. After increasingminCapacity
to 30 they were able to restore SLO to the initial target. We can monitor the impact of reducingminCapacity
to 25 by checking the following dashboards: