-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resiliency to a single degraded Availability Zone #939
Comments
Would like to add that we have seen customer cases where similar failures were caused by degraded disk performance. |
We were looking into this area a month or so ago and I desperately wanted to try and find a way of observing CF's already-existing healthchecks either through monitoring or logs instead of adding yet another set of canaries for this. I was hopeful in spotting the
So we did end up deploying another set of healthcheck canaries for this purpose.. |
As @46bit says, the GOV.UK PaaS approach is to block all traffic to the AZ (we're in AWS, so it's done with a network ACL). In testing, this appeared to work sufficiently well, and we saw CloudFoundry correctly redistribute tenant applications running in the affected AZ on to cells in the unaffected AZ. We've also gone one step further and automated the process of removing the AZ in Bosh via our pipelines (specifically, we apply an ops file with removes the AZ from every instance group). The goal here is to be able to restore the level of capacity we had before the AZ outage by spreading it over the remaining AZs, so that we don't run in to resource contention problems when 100% of the platform load is placed on 66% of the capacity. We've identified a couple of problems so far, but we think we're not really in a position to solve them:
I think if I could wave a magic wand today and get a solution instantly, I think I'd like it to lay with Bosh. It would be a very nice capability for it to
I say I think it should lay with Bosh, because I think it'd be nice to have these capabilities for non- |
@AP-Hunt we have seen a lot of cases where the EC2 API was overloaded / not reachable and AWS does not guarantee any kind of free resources during such a large scale event. Therefore i would like to conclude that you can not 100% rely on respawning bosh VM's. Instead you should overprovision before the incident will happen and just evacuate the workload during the degragation to the remaining AZ's. If the AZ is healthy again according to the local health check, the REP process (or a process next to REP) can let the diego cell receive workload again. GoRouters should be covered by the hyperscaler loadbalancer with health checks, if they are slow = remove GoRouter from loadbalancing |
Not to worry—they maintain roughly the right amount of spare capacity.
|
If something like this existed, it feels like it should live in the HealthMonitor. Agents already send alerts to the HealthMonitor so this feels like it would fit well alongside that existing functionality. The HealthMonitor also already has a plugin interface. You can co-locate a job that includes a One of the reasons something like this hasn't been built in the past is it's likely not to solve most of the problems, and is certainly not a silver bullet for partially degraded AZs. The whole reason the "meltdown" trigger exists in the HealthMonitor is IAASs typically aren't happy about being asked to do things when they're already in a broken state. The idea of ignoring the IAAS and just draining the jobs does work around a good chunk of those problems. It would be pretty simple for the HealthMonitor to just Actually detecting problems is sort of a nightmare though. We're looking for problems that the Agent can accurately detect from inside the VM, but that wouldn't prevent the HealthMonitor and the VM from communicating. I don't think there is a one size fits all solution for every use case, so we'd need some way to have a runtime config with a job that the agent knows how to call to ask it to check the health maybe? Or some other mechanism for configuring the agent so it knows who to ask for health info... I'm a bit skeptical of the idea of automatically rebalancing the workloads onto the remaining AZs. The main reason to use AZs is for HA. But if you need the full capacity of all of your AZs to be able to maintain your workloads, you're not really HA. If you are using AZs for HA, your system should be able to work fine with one of the AZs totally dead. |
I think you're totally right about there not being a one size fits all
solution. That's why I'd vote for Bosh being able to raise an alarm or
change a metric value. That would allow operators to respond appropriately
for their situation (be that automatically, or with some manual
intervention, or a mix of the two).
I also wouldn't vote for automatic re-balancing, because that isn't right
for everyone either. It would work for GOV.UK PaaS because we run enough
capacity in 2 AZs to cover a missing AZ, but it's better experience overall
(e.g. lower overall demand on each cell) if we're able to run 100% capacity
in those two AZs while the third recovers.
I personally wouldn't be fussed if the implementation of rebalancing
existed within Bosh, or if it was up to operators to change their manifests
to remove the AZ.
…On Mon, 13 Sep 2021, 17:22 Joseph Palermo, ***@***.***> wrote:
If something like this existed, it feels like it should live in the
HealthMonitor. Agents already send alerts to the HealthMonitor so this
feels like it would fit well alongside that existing functionality.
The HealthMonitor also already has a plugin interface. You can co-locate a
job that includes a bosh-monitor binary and HealthMonitor will call that
job with JSON stdin that contains "something". Never really looked at it,
but it may be useful here (or maybe not)
One of the reasons something like this hasn't been built in the past is
it's likely not to solve most of the problems, and is certainly not a
silver bullet for partially degraded AZs.
The whole reason the "meltdown" trigger exists in the HealthMonitor is
IAASs typically aren't happy about being asked to do things when they're
already in a broken state. The idea of ignoring the IAAS and just draining
the jobs does work around a good chunk of those problems. It would be
pretty simple for the HealthMonitor to just bosh stop the instances which
should trigger all the normal bosh lifecycle events.
Actually detecting problems is sort of a nightmare though. We're looking
for problems that the Agent can accurately detect from inside the VM, but
that wouldn't prevent the HealthMonitor and the VM from communicating. I
don't think there is a one size fits all solution for every use case, so
we'd need some way to have a runtime config with a job that the agent knows
how to call to ask it to check the health maybe? Or some other mechanism
for configuring the agent so it knows who to ask for health info...
I'm a bit skeptical of the idea of automatically rebalancing the workloads
onto the remaining AZs. The main reason to use AZs is for HA. But if you
need the full capacity of all of your AZs to be able to maintain your
workloads, you're not really HA. If you are using AZs for HA, your system
should be able to work fine with one of the AZs totally dead.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#939 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANKTOTQGEM3ZXDY76EESX3UBYQLDANCNFSM5DQNQHKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I know one thing that has been tossed around is the idea of a full http/readiness check, rather then only the local healthiness checks on the bosh vms. If such a (large tbh) change was implemented, we would get a lot of interesting outcomes. One of those outcomes would be putting individual app instances behind some level of iaas level network healthiness check. This would of course, not reschedule instances to other azs, or solve the problem with the az itself, or reschedule diego cells, but it would, in the instance of a az network failure, probably start to remove app instances from routing tables on that instance across the board. (that said, I think you also might experience problems where your load balancer still get's served to and then redirects to other azs perhaps still experiencing network difficulties). Either way, I think this problem is actually quite complex and intersecting with other fields. There's probably a great variety of techniques that can be applied to this and I think we need some upper level (perhaps working group or higher level) set of understandings or plans in order to approach some of this "correctly" or "completely", or even "to the satisfaction of this particular problem". Specifically, we'd do well to rethink what "high availability means" and what steps are expected to be manual and what are expected to be automatic, and within what parameters those steps are to be automatic. |
Thanks for your feedback everyone. It sounds like BOSH could be evolved to natively support solving problems like this, or even have it added as a plugin. That would be quite neat, but SAP don't have a Highly-Available BOSH. Neither does GOV.UK PaaS. As I understand it, HA BOSH isn't widely used by anyone. That makes it quite a bad place to solve issues like this: there's a 1/N chance that BOSH itself is affected. At SAP we've been working on an agent and boshrelease named Runtime Evacuation. It'll be deployed on each Diego cell, monitor network performance, and drain Rep if the network appears to be badly compromised. The critical challenge isto avoid creating new issues (e.g. all the cells deciding to switch off at once), so to start with we're going to disable it taking action and monitor the data for awhile. Hopefully this can be open sourced, I think we're looking into it.
A totally dead AZ is easy to deal with, but that's also very rare. Much more common is degraded AZs, with slower responses and higher error rates. Those are a bit of a nightmare. CF won't route traffic away from the AZ unless it's completely dead. Both SAP and GOV.UK PaaS have had situations where 1/3rd-ish of traffic is having major problems even though 2/3 AZs are perfectly healthy. |
I've seen that 1/3rd degraded traffic before too and we've yet to find a spot that feels good to build a solution into. There's always a tricky mix of "It can solve this particular problem, but will actually make this other problem worse". So something that's able to solve even some of those degraded AZ problems, while not doing harm in other situations would be amazing. |
This issue is not a bug in
cf-deployment
, but it's to discuss solving a common incident for CF operators.What is this issue about?
Several Cloud Foundry users have had outages when a single Availability Zone experiences a partial failure. Incidents like degraded networking are far more common in the Cloud than a complete outage.
Cloud Foundry is engineered to run in multiple AZs, but not to handle degraded single AZs. When a single AZ is only degraded, Cloud Foundry will keep directing new app instances and new web requests into that degraded AZ. These requests will be slow or fail. This makes Cloud Foundry partly down for its users, and right now there are few good options.
What can CF operators do right now?
Neither of the options we're aware of are very good.
Very slow: You can edit the CF manifest and do a new BOSH deploy that doesn't have VMs in the affected AZ. This is far too slow as the BOSH deploy could take an entire day for the largest CF platforms.
Slow/manual: You can choose to manually block all network traffic into the degraded Availability Zone, for instead using firewall rules. This is the approach being used by GOV.UK PaaS. This has the advantage of being very simple, but it's not seen as automated or fast enough for SAP's needs.
What do you propose?
At SAP, we think the best solution is for each VM to monitor its health. For instance an operator could configure a list of network checks. If too many of the checks fail, the VM would drain itself and kill the BOSH agent. This could also be part of Diego, and trigger a call to Rep's evacuate endpoint.
This solution can cope with more just degraded AZs, as it would drain individual degraded servers (e.g. failing racks.)
Badly chosen checks could make cells drain themselves wrongly and lead to CF downtime. This would probably be an optional feature and so the CF operator would be able to choose good network resources to check (e.g. a combination of the CF API, S3, etc.)
Tag your pair, your PM, and/or team!
Working on this with @h0nlg at SAP. Briefly talked about this with @rkoster and @AP-Hunt.
The text was updated successfully, but these errors were encountered: