-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diego::Client targets active bbs instance with id from locket #3002
Diego::Client targets active bbs instance with id from locket #3002
Conversation
This holds up quite nicely with a HA setup, even when all VMs in a particular availability zone are isolated to simulate an outage. I've run the majority cf-acceptance-test specs, and all pass in that scenario. Running without the changes in the same scenario, the very first {
"created_at": "2022-10-11T09:47:51Z",
"errors": [
{
"code": 170015,
"detail": "Runner is unavailable: Process Guid: eb07d1a8-b37b-440a-b3fe-166d877c99b8-35a31446-15a1-4813-b489-89a208e5c179: execution expired",
"title": "CF-RunnerUnavailable"
}
],
"guid": "a42634d2-2b91-4030-8e57-0d89e37070cb",
"links": {
"self": {
"href": "https://redacted.sapcloud.io/v3/jobs/a42634d2-2b91-4030-8e57-0d89e37070cb"
},
"space": {
"href": "https://redacted.sapcloud.io/v3/spaces/013822bf-4c0d-40e6-8fa7-1622d46ae59a"
}
},
"operation": "space.apply_manifest",
"state": "FAILED",
"updated_at": "2022-10-11T09:48:22Z",
"warnings": []
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using locket to determine the active bbs instance!
a781f23
to
69587e9
Compare
Locket seems like an implementation detail of Diego, I'd hesitate to use it for service discovery from another component without thinking deeply about it. Could this be solved by going through the list of IPs returned instead of picking one randomly? Presumably the successful IP could be cached in memory and used until Diego goes through another leader election? |
8a11503
to
3c7806e
Compare
Renamed this to Locket::Client and changed its initialization args so it can be used where the 'owner' of a key isn't already known. This enables Diego::Client to use it to retrieve the bosh instance id of the bbs server that is currently active.
Get the bosh instance id for the currently-active bbs process from locket, so that Diego::Client can use an instance-specific bosh-dns URL instead of one that resolves to all IPs in the instance group (i.e. fd0458bf-71f9-4df2-9ab1-5fb523a92e56.bbs.service.cf.internal, not bbs.service.cf.internal). Mitigates the risk of Runner is unavailable errors during a cf push when the client is unable to reach a bbs instance. bbs operates as a cluster. At any one time one instance is 'active', and able to serve responses to API calls, while others are inactive. An active instance holds the bbs lock from locket. At present, whenever the Diego::Client needs to communicate with bbs it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release is https://bbs.service.cf.internal:8889. This domain is defined by cf-deployment as a bosh-dns alias for the diego-api instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order). When all bbs instances are reachable, you either hit the active instance first, or you hit an inactive one (connection refused) and the http client tries the next IP in the list. But if the first IP in the list is unreachable, the request hangs until it times out (by default after 10s) and then fails. The Diego::Client makes three attempts to reach the active bbs instance before giving up and raising an error. Say you've got two availability zones with one diego-api instance in each, and a network issue makes one unreachable. In such a case, every API call currently has a 1/8 chance of that the first IP in the list is the unreachable one three times in a row, which will cause cf push and other commands to fail.
3c7806e
to
7a5681a
Compare
Definitely a valid concern. At the same time it's worth weighing against that the fact that, aside from Diego itself, there are four other components packaged in cf-deployment with clients that talk to Locket. Those are the routing-api, tps, policy-server-asg-syncer and cc_deployment_updater. The latter is part of cloud_controller_ng, and has had a dependency on locket since CAPI 1.63.0, in 2018. The risk is lowered a bit more thanks to the fact that we're doing this with a quite modest tweak to a pre-existing wrapper around a gRPC client whose ruby code has been generated from a protofile. That provides some insurance against things changing. Given the above, and the relatively superficial nature of the dependency I've added, I feel on balance it's unlikely that future changes in the way Diego uses Locket are going to be drastic enough to risk serious disruption for us (or that the CC's usage would deter such refactoring).
Currently the CC uses the It'd be good to hear from others, as the fact that I've just finished working on a different solution biases my opinion somewhat! |
@sethboyles If you get time, it'd be great to get your thoughts on this! |
Also threw together a couple of diagrams. First one shows what can happen today if there's an outage and the first IP in the randomly-ordered list that The second one shows how this would work with the change I'm proposing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I won't approve until we've sorted out the concerns raised by @mkocher.
b233f24
to
9c72962
Compare
Both TPS and the deployment-updater are a part of CAPI, but as far as I can tell both of them use Locket to register their own locks, not query information about Diego. I don't have much of an opinion yet on whether or not using Locket to query information about Diego is a pattern we should avoid or not, but I know little about Locket and how Diego uses it. Perhaps we should reach out to the Diego team for their thoughts? |
Yep I'll ask. @ameowlia, @geofffranks, @jrussett - I wonder if any of yourselves, or your other Diego colleagues, would have an opinion on the proposed use of Locket here? |
This seems like a great poposal to me. The data is authoritative about who the active BBS VM is. There's precedent for this in the |
Wait, hold on, it feels very wrong to have a service running on the platform that a client would have to inspect which instances of the service are healthy and then target them directly.
My first thought was that bosh-dns healthiness could be used to ensure the "unhealthy" VMs aren't actually returned by Using process level healthiness does "solve" the problem with s3 queries mentioned above, because you could just switch to s0 and in the case where another job on the active bbs VM is failing, they would all be in an "unhealthy" state, and you'd basically fall back to the behavior we have now. But again, people do co-locate jobs differently than is done in cf-deployment, so until we add job level healthiness to bosh-dns-aliases, this seems like probably a bad idea. So the other option would be to put this healthiness logic in the bbs itself, and then when a non-active VM got a request, it could proxy it to a healthy one or maybe it could just return a |
The origin of the problem is that the Ruby HTTP client that is used for talking to BBS is bad in handling unreachable (i.e. isolated from a network point of view) endpoints. This is not a general problem; when you use cURL to retrieve data from I don't think that Diego/BBS is doing something wrong; rejecting connections on an inactive VM seems like a reasonable approach and works well - the issue we are seeing is only related to unreachable VMs. In general I see three different possibilities to solve this:
[1]
|
Thanks for those details @philippthun that does make it seem less like a Diego problem than I was saying and I'm no longer thinking the problem should be solved elsewhere. That's unfortunate that the timeout isn't configurable in the http client, that would be an ideal solution. |
I think it's worth evaluating other HTTP clients to see if there is support for this before adding a workaround on our side. |
httpclient hasn't had a release in 6 years: https://rubygems.org/gems/httpclient |
Ok given the mixed reaction to this I'm going to close it and we'll take a look at alternative HTTP clients. |
Picking this story up again. Just noting that as far as I can tell there doesn't seem to be any (maintained) ruby HTTP client that, for a domain that resolves to multiple IPs, will try the next IP in the list if the first one it tries times out. So I think we'll just need to resolve the DNS ourselves as @mkocher suggested. |
Get the bosh instance id for the currently-active
bbs
process fromlocket
, so that Diego::Client can use an instance-specific bosh-dns URL instead of one that resolves to all IPs in the instance group (i.e.fd0458bf-71f9-4df2-9ab1-5fb523a92e56.bbs.service.cf.internal
, notbbs.service.cf.internal
).Mitigates the risk of
Runner is unavailable
errors during acf push
when the client is unable to reach abbs
instance.bbs
operates as a cluster. At any one time one instance is 'active', and able to serve responses to API calls, while others are inactive. An active instance holds thebbs
lock fromlocket
.At present, whenever the Diego::Client needs to communicate with
bbs
it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release ishttps://bbs.service.cf.internal:8889
. This domain is defined by cf-deployment as a bosh-dns alias for thediego-api
instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order).When all
bbs
instances are reachable, you either hit the active instance first, or you hit an inactive one (connection refused
) and the http client (implicitly) tries the next IP in the list. But if the first IP in the list is unreachable, the request hangs until it times out (by default after 10s) and then fails. The Diego::Client makes three attempts to reach the activebbs
instance before giving up and raising an error, with ourhttpclient
re-resolving DNS each time.Say you've got two
diego-api
instances, and a network issue makes one of them unreachable. In such a case, every API call currently has a 1/8 chance of that the first IP in the list is the unreachable one three times in a row, which will causecf push
and some other commands to fail.Note:
We considered switching to a
q-s3
bosh-dns query (healthy instances only) but decided against this. If some process other thanbbs
was failing on an active bbs instance clients would be stuck unable to reach the Diego API (DNS would resolve only to inactive instances, while nothing would trigger a new leadership election).locket
is colocated withbbs
on thediego-api
instance. Retries are handled automatically by gRPC (and with a shorter 5-second timeout).Links to any other associated PRs
cloudfoundry/capi-release#275 (should be merged before this PR)
I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the
main
branchI have run all the unit tests using
bundle exec rake
I have run CF Acceptance Tests*
* Run against the following cf-acceptance-test suites: apps, container_networking, detect, docker, internet_dependent, routing, service_instance_sharing, services, ssh, sso, tasks, v3