Diego::Client targets active bbs instance with id from locket #3002

will-gant · 2022-10-03T15:50:04Z

A short explanation of the proposed change:

Get the bosh instance id for the currently-active bbs process from locket, so that Diego::Client can use an instance-specific bosh-dns URL instead of one that resolves to all IPs in the instance group (i.e. fd0458bf-71f9-4df2-9ab1-5fb523a92e56.bbs.service.cf.internal, not bbs.service.cf.internal).

An explanation of the use cases your change solves

Mitigates the risk of Runner is unavailable errors during a cf push when the client is unable to reach a bbs instance.

bbs operates as a cluster. At any one time one instance is 'active', and able to serve responses to API calls, while others are inactive. An active instance holds the bbs lock from locket.

At present, whenever the Diego::Client needs to communicate with bbs it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release is https://bbs.service.cf.internal:8889. This domain is defined by cf-deployment as a bosh-dns alias for the diego-api instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order).

When all bbs instances are reachable, you either hit the active instance first, or you hit an inactive one (connection refused) and the http client (implicitly) tries the next IP in the list. But if the first IP in the list is unreachable, the request hangs until it times out (by default after 10s) and then fails. The Diego::Client makes three attempts to reach the active bbs instance before giving up and raising an error, with our httpclient re-resolving DNS each time.

Say you've got two diego-api instances, and a network issue makes one of them unreachable. In such a case, every API call currently has a 1/8 chance of that the first IP in the list is the unreachable one three times in a row, which will cause cf push and some other commands to fail.

Note:

We considered switching to a q-s3 bosh-dns query (healthy instances only) but decided against this. If some process other than bbs was failing on an active bbs instance clients would be stuck unable to reach the Diego API (DNS would resolve only to inactive instances, while nothing would trigger a new leadership election).
locket is colocated with bbs on the diego-api instance. Retries are handled automatically by gRPC (and with a shorter 5-second timeout).
Links to any other associated PRs

cloudfoundry/capi-release#275 (should be merged before this PR)

I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the main branch
I have run all the unit tests using bundle exec rake
I have run CF Acceptance Tests*
* Run against the following cf-acceptance-test suites: apps, container_networking, detect, docker, internet_dependent, routing, service_instance_sharing, services, ssh, sso, tasks, v3

will-gant · 2022-10-11T15:03:06Z

This holds up quite nicely with a HA setup, even when all VMs in a particular availability zone are isolated to simulate an outage. I've run the majority cf-acceptance-test specs, and all pass in that scenario. Running without the changes in the same scenario, the very first cf push fails with this:

{
  "created_at": "2022-10-11T09:47:51Z",
  "errors": [
    {
      "code": 170015,
      "detail": "Runner is unavailable: Process Guid: eb07d1a8-b37b-440a-b3fe-166d877c99b8-35a31446-15a1-4813-b489-89a208e5c179: execution expired",
      "title": "CF-RunnerUnavailable"
    }
  ],
  "guid": "a42634d2-2b91-4030-8e57-0d89e37070cb",
  "links": {
    "self": {
      "href": "https://redacted.sapcloud.io/v3/jobs/a42634d2-2b91-4030-8e57-0d89e37070cb"
    },
    "space": {
      "href": "https://redacted.sapcloud.io/v3/spaces/013822bf-4c0d-40e6-8fa7-1622d46ae59a"
    }
  },
  "operation": "space.apply_manifest",
  "state": "FAILED",
  "updated_at": "2022-10-11T09:48:22Z",
  "warnings": []
}

philippthun

I like the idea of using locket to determine the active bbs instance!

lib/cloud_controller/config_schemas/base/api_schema.rb

lib/cloud_controller/config_schemas/base/worker_schema.rb

lib/locket/client.rb

lib/diego/client.rb

lib/locket/client.rb

lib/diego/client.rb

lib/locket/client.rb

mkocher · 2022-10-19T23:34:34Z

Locket seems like an implementation detail of Diego, I'd hesitate to use it for service discovery from another component without thinking deeply about it.

Could this be solved by going through the list of IPs returned instead of picking one randomly? Presumably the successful IP could be cached in memory and used until Diego goes through another leader election?

linux-foundation-easycla · 2022-10-20T14:04:08Z

The committers listed above are authorized under a signed CLA.

✅ login: will-gant / name: Will Gant (c98d0e0, 7a5681a)

For details, see cloudfoundry/cloud_controller_ng#3002

Renamed this to Locket::Client and changed its initialization args so it can be used where the 'owner' of a key isn't already known. This enables Diego::Client to use it to retrieve the bosh instance id of the bbs server that is currently active.

Get the bosh instance id for the currently-active bbs process from locket, so that Diego::Client can use an instance-specific bosh-dns URL instead of one that resolves to all IPs in the instance group (i.e. fd0458bf-71f9-4df2-9ab1-5fb523a92e56.bbs.service.cf.internal, not bbs.service.cf.internal). Mitigates the risk of Runner is unavailable errors during a cf push when the client is unable to reach a bbs instance. bbs operates as a cluster. At any one time one instance is 'active', and able to serve responses to API calls, while others are inactive. An active instance holds the bbs lock from locket. At present, whenever the Diego::Client needs to communicate with bbs it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release is https://bbs.service.cf.internal:8889. This domain is defined by cf-deployment as a bosh-dns alias for the diego-api instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order). When all bbs instances are reachable, you either hit the active instance first, or you hit an inactive one (connection refused) and the http client tries the next IP in the list. But if the first IP in the list is unreachable, the request hangs until it times out (by default after 10s) and then fails. The Diego::Client makes three attempts to reach the active bbs instance before giving up and raising an error. Say you've got two availability zones with one diego-api instance in each, and a network issue makes one unreachable. In such a case, every API call currently has a 1/8 chance of that the first IP in the list is the unreachable one three times in a row, which will cause cf push and other commands to fail.

will-gant · 2022-10-20T16:47:38Z

Locket seems like an implementation detail of Diego, I'd hesitate to use it for service discovery from another component without thinking deeply about it.

Definitely a valid concern. At the same time it's worth weighing against that the fact that, aside from Diego itself, there are four other components packaged in cf-deployment with clients that talk to Locket. Those are the routing-api, tps, policy-server-asg-syncer and cc_deployment_updater. The latter is part of cloud_controller_ng, and has had a dependency on locket since CAPI 1.63.0, in 2018.

The risk is lowered a bit more thanks to the fact that we're doing this with a quite modest tweak to a pre-existing wrapper around a gRPC client whose ruby code has been generated from a protofile. That provides some insurance against things changing.

Given the above, and the relatively superficial nature of the dependency I've added, I feel on balance it's unlikely that future changes in the way Diego uses Locket are going to be drastic enough to risk serious disruption for us (or that the CC's usage would deter such refactoring).

Could this be solved by going through the list of IPs returned instead of picking one randomly? Presumably the successful IP could be cached in memory and used until Diego goes through another leader election?

Currently the CC uses the httpclient, which handles DNS resolution. The alternative would, I guess, be for Diego::Client to use another library to do an 'explicit' lookup with bosh-dns, loop over the IP list returned, and then try a httpclient.post on each IP, rescuing on the failures. Then cache and reuse the IP for the active server. This would work, though I think having the client doing DNS lookups like this and continuing to use trial-and-error to discover the server (albeit with the caching) isn't ideal.

It'd be good to hear from others, as the fact that I've just finished working on a different solution biases my opinion somewhat!

@mkocher

will-gant · 2022-10-21T07:58:40Z

@sethboyles If you get time, it'd be great to get your thoughts on this!

will-gant · 2022-10-21T09:23:43Z

Also threw together a couple of diagrams. First one shows what can happen today if there's an outage and the first IP in the randomly-ordered list that bbs.service.cf.internal resolves to is, by chance, an unreachable diego-api instance three times in a row:

The second one shows how this would work with the change I'm proposing:

lib/diego/client.rb

spec/unit/lib/locket/lock_worker_spec.rb

spec/diego/client_spec.rb

philippthun

Looks good to me. I won't approve until we've sorted out the concerns raised by @mkocher.

sethboyles · 2022-10-21T17:23:54Z

At the same time it's worth weighing against that the fact that, aside from Diego itself, there are four other components packaged in cf-deployment with clients that talk to Locket

Both TPS and the deployment-updater are a part of CAPI, but as far as I can tell both of them use Locket to register their own locks, not query information about Diego.

I don't have much of an opinion yet on whether or not using Locket to query information about Diego is a pattern we should avoid or not, but I know little about Locket and how Diego uses it. Perhaps we should reach out to the Diego team for their thoughts?

will-gant · 2022-10-21T20:12:04Z

Yep I'll ask. @ameowlia, @geofffranks, @jrussett - I wonder if any of yourselves, or your other Diego colleagues, would have an opinion on the proposed use of Locket here?

geofffranks · 2022-10-21T20:48:36Z

This seems like a great poposal to me. The data is authoritative about who the active BBS VM is. There's precedent for this in the cfdot locks command (pulls data from locket to display the active BBS + Aucitoneer VMs).

jpalermo · 2022-10-21T23:35:20Z

Wait, hold on, it feels very wrong to have a service running on the platform that a client would have to inspect which instances of the service are healthy and then target them directly.

bbs.service.cf.internal should not return instances that do not actually work, which is what it sounds like current exists.

My first thought was that bosh-dns healthiness could be used to ensure the "unhealthy" VMs aren't actually returned by bbs.service.cf.internal, and that does work great, right up until you co-locate the bbs job on a VM with other jobs, now the whole VM is unhealthy for bosh dns query purposes, which is probably bad.

Using process level healthiness does "solve" the problem with s3 queries mentioned above, because you could just switch to s0 and in the case where another job on the active bbs VM is failing, they would all be in an "unhealthy" state, and you'd basically fall back to the behavior we have now. But again, people do co-locate jobs differently than is done in cf-deployment, so until we add job level healthiness to bosh-dns-aliases, this seems like probably a bad idea.

So the other option would be to put this healthiness logic in the bbs itself, and then when a non-active VM got a request, it could proxy it to a healthy one or maybe it could just return a 302 and the clients would follow that to the proper VM? That would mean clients don't need to do any special logic and can just query bbs.service.cf.internal.

philippthun · 2022-10-24T09:11:23Z

The origin of the problem is that the Ruby HTTP client that is used for talking to BBS is bad in handling unreachable (i.e. isolated from a network point of view) endpoints.

This is not a general problem; when you use cURL to retrieve data from bbs.service.cf.internal and there is an unreachable VM, the request to this VM times out very fast and the next IP address is used. So in general this is something that a client could handle.

I don't think that Diego/BBS is doing something wrong; rejecting connections on an inactive VM seems like a reasonable approach and works well - the issue we are seeing is only related to unreachable VMs.

In general I see three different possibilities to solve this:

Fix/enhance the currently used Ruby HTTP client.
Identify and use a different Ruby HTTP client that behaves more like cURL.
Use an alternative approach to identify the currently active BBS instance to circumvent issues with unreachable VMs.

We tested a monkey patch (see [1]) of the TCPSocket class which introduced a connect_timeout parameter with Ruby 3.0. Setting this to a lower value than the HTTPClient.connect_timeout (default: 10 seconds), results in a similar behavior as for a refused connection, i.e. the next IP address is tried implicitly. We didn't dare to apply this monkey patch as it would affect all TCP sockets created in Ruby and we are also not sure if a contribution to the httpclient gem would be feasible.
We did not go down this road so far.
That's the purpose of this PR. It can be seen as a workaround for the deficient Ruby HTTP client, but it prevents the issues we are seeing with unreachable VMs.

[1]

module DefaultConnectTimeout
  DEFAULT_CONNECT_TIMEOUT = 2 # seconds

  def initialize(remote_host, remote_port, local_host = nil, local_port = nil, connect_timeout: nil)
    super(remote_host, remote_port, local_host, local_port, connect_timeout: connect_timeout || DEFAULT_CONNECT_TIMEOUT)
  end
end

class TCPSocket
  prepend DefaultConnectTimeout
end

jpalermo · 2022-10-24T17:23:28Z

Thanks for those details @philippthun that does make it seem less like a Diego problem than I was saying and I'm no longer thinking the problem should be solved elsewhere.

That's unfortunate that the timeout isn't configurable in the http client, that would be an ideal solution.

sethboyles · 2022-10-26T18:45:27Z

I think it's worth evaluating other HTTP clients to see if there is support for this before adding a workaround on our side.

sethboyles · 2022-10-26T18:55:14Z

httpclient hasn't had a release in 6 years: https://rubygems.org/gems/httpclient

will-gant · 2022-10-27T07:50:45Z

Ok given the mixed reaction to this I'm going to close it and we'll take a look at alternative HTTP clients.

will-gant · 2022-11-03T10:38:06Z

Picking this story up again. Just noting that as far as I can tell there doesn't seem to be any (maintained) ruby HTTP client that, for a domain that resolves to multiple IPs, will try the next IP in the list if the first one it tries times out.

So I think we'll just need to resolve the DNS ourselves as @mkocher suggested.

cf-gitbot added the unscheduled label Oct 3, 2022

will-gant changed the title ~~Cc bbs communication~~ Diego::Client gets id of active bbs instance from locket Oct 10, 2022

will-gant changed the title ~~Diego::Client gets id of active bbs instance from locket~~ Diego::Client targets active bbs instance with id from locket Oct 10, 2022

will-gant mentioned this pull request Oct 10, 2022

Pass locket config to cloud_controller_ng and worker cloudfoundry/capi-release#275

Closed

3 tasks

will-gant marked this pull request as ready for review October 11, 2022 15:03

philippthun self-assigned this Oct 12, 2022

philippthun reviewed Oct 14, 2022

View reviewed changes

lib/locket/client.rb Outdated Show resolved Hide resolved

will-gant force-pushed the cc-bbs-communication branch from a781f23 to 69587e9 Compare October 17, 2022 08:08

philippthun reviewed Oct 19, 2022

View reviewed changes

lib/diego/client.rb Outdated Show resolved Hide resolved

lib/locket/client.rb Show resolved Hide resolved

will-gant force-pushed the cc-bbs-communication branch from 8a11503 to 3c7806e Compare October 20, 2022 14:03

will-gant added a commit to sap-contributions/capi-release that referenced this pull request Oct 20, 2022

Pass locket config to cc and cc worker

e22777e

For details, see cloudfoundry/cloud_controller_ng#3002

will-gant added 2 commits October 20, 2022 15:08

will-gant force-pushed the cc-bbs-communication branch from 3c7806e to 7a5681a Compare October 20, 2022 14:09

will-gant requested a review from philippthun October 20, 2022 14:31

philippthun reviewed Oct 21, 2022

View reviewed changes

will-gant added 2 commits October 21, 2022 11:50

Adopt PR feedback

db1f01c

Fix test descriptions

9c72962

philippthun reviewed Oct 21, 2022

View reviewed changes

will-gant force-pushed the cc-bbs-communication branch from b233f24 to 9c72962 Compare October 21, 2022 16:24

will-gant closed this Oct 27, 2022

cf-gitbot removed the unscheduled label Oct 27, 2022

will-gant mentioned this pull request Nov 6, 2022

Diego client tries all bbs domain IPs #3048

Merged

5 tasks

This was referenced Feb 1, 2023

Use Net::HTTP::Persistent in Diego client #3170

Merged

Patch HTTPClient to accept socket_connect_timeout #3174

Merged

johha mentioned this pull request Feb 7, 2023

Consolidate HTTP Clients #3178

Open

johha unassigned philippthun Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diego::Client targets active bbs instance with id from locket #3002

Diego::Client targets active bbs instance with id from locket #3002

will-gant commented Oct 3, 2022 •

edited

Loading

will-gant commented Oct 11, 2022 •

edited

Loading

philippthun left a comment •

edited

Loading

mkocher commented Oct 19, 2022

linux-foundation-easycla bot commented Oct 20, 2022 •

edited

Loading

will-gant commented Oct 20, 2022 •

edited

Loading

will-gant commented Oct 21, 2022

will-gant commented Oct 21, 2022 •

edited

Loading

philippthun left a comment

sethboyles commented Oct 21, 2022 •

edited

Loading

will-gant commented Oct 21, 2022 •

edited

Loading

geofffranks commented Oct 21, 2022

jpalermo commented Oct 21, 2022

philippthun commented Oct 24, 2022

jpalermo commented Oct 24, 2022

sethboyles commented Oct 26, 2022

sethboyles commented Oct 26, 2022

will-gant commented Oct 27, 2022

will-gant commented Nov 3, 2022

Diego::Client targets active bbs instance with id from locket #3002

Diego::Client targets active bbs instance with id from locket #3002

Conversation

will-gant commented Oct 3, 2022 • edited Loading

will-gant commented Oct 11, 2022 • edited Loading

philippthun left a comment • edited Loading

Choose a reason for hiding this comment

mkocher commented Oct 19, 2022

linux-foundation-easycla bot commented Oct 20, 2022 • edited Loading

will-gant commented Oct 20, 2022 • edited Loading

will-gant commented Oct 21, 2022

will-gant commented Oct 21, 2022 • edited Loading

philippthun left a comment

Choose a reason for hiding this comment

sethboyles commented Oct 21, 2022 • edited Loading

will-gant commented Oct 21, 2022 • edited Loading

geofffranks commented Oct 21, 2022

jpalermo commented Oct 21, 2022

philippthun commented Oct 24, 2022

jpalermo commented Oct 24, 2022

sethboyles commented Oct 26, 2022

sethboyles commented Oct 26, 2022

will-gant commented Oct 27, 2022

will-gant commented Nov 3, 2022

will-gant commented Oct 3, 2022 •

edited

Loading

will-gant commented Oct 11, 2022 •

edited

Loading

philippthun left a comment •

edited

Loading

linux-foundation-easycla bot commented Oct 20, 2022 •

edited

Loading

will-gant commented Oct 20, 2022 •

edited

Loading

will-gant commented Oct 21, 2022 •

edited

Loading

sethboyles commented Oct 21, 2022 •

edited

Loading

will-gant commented Oct 21, 2022 •

edited

Loading