Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diego::Client targets active bbs instance with id from locket #3002

Closed

Conversation

will-gant
Copy link
Contributor

@will-gant will-gant commented Oct 3, 2022

  • A short explanation of the proposed change:

Get the bosh instance id for the currently-active bbs process from locket, so that Diego::Client can use an instance-specific bosh-dns URL instead of one that resolves to all IPs in the instance group (i.e. fd0458bf-71f9-4df2-9ab1-5fb523a92e56.bbs.service.cf.internal, not bbs.service.cf.internal).

  • An explanation of the use cases your change solves

Mitigates the risk of Runner is unavailable errors during a cf push when the client is unable to reach a bbs instance.

bbs operates as a cluster. At any one time one instance is 'active', and able to serve responses to API calls, while others are inactive. An active instance holds the bbs lock from locket.

At present, whenever the Diego::Client needs to communicate with bbs it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release is https://bbs.service.cf.internal:8889. This domain is defined by cf-deployment as a bosh-dns alias for the diego-api instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order).

When all bbs instances are reachable, you either hit the active instance first, or you hit an inactive one (connection refused) and the http client (implicitly) tries the next IP in the list. But if the first IP in the list is unreachable, the request hangs until it times out (by default after 10s) and then fails. The Diego::Client makes three attempts to reach the active bbs instance before giving up and raising an error, with our httpclient re-resolving DNS each time.

Say you've got two diego-api instances, and a network issue makes one of them unreachable. In such a case, every API call currently has a 1/8 chance of that the first IP in the list is the unreachable one three times in a row, which will cause cf push and some other commands to fail.

Note:

  • We considered switching to a q-s3 bosh-dns query (healthy instances only) but decided against this. If some process other than bbs was failing on an active bbs instance clients would be stuck unable to reach the Diego API (DNS would resolve only to inactive instances, while nothing would trigger a new leadership election).

  • locket is colocated with bbs on the diego-api instance. Retries are handled automatically by gRPC (and with a shorter 5-second timeout).

  • Links to any other associated PRs

cloudfoundry/capi-release#275 (should be merged before this PR)

  • I have reviewed the contributing guide

  • I have viewed, signed, and submitted the Contributor License Agreement

  • I have made this pull request to the main branch

  • I have run all the unit tests using bundle exec rake

  • I have run CF Acceptance Tests*
    * Run against the following cf-acceptance-test suites: apps, container_networking, detect, docker, internet_dependent, routing, service_instance_sharing, services, ssh, sso, tasks, v3

@will-gant will-gant changed the title Cc bbs communication Diego::Client gets id of active bbs instance from locket Oct 10, 2022
@will-gant will-gant changed the title Diego::Client gets id of active bbs instance from locket Diego::Client targets active bbs instance with id from locket Oct 10, 2022
@will-gant
Copy link
Contributor Author

will-gant commented Oct 11, 2022

This holds up quite nicely with a HA setup, even when all VMs in a particular availability zone are isolated to simulate an outage. I've run the majority cf-acceptance-test specs, and all pass in that scenario. Running without the changes in the same scenario, the very first cf push fails with this:

{
  "created_at": "2022-10-11T09:47:51Z",
  "errors": [
    {
      "code": 170015,
      "detail": "Runner is unavailable: Process Guid: eb07d1a8-b37b-440a-b3fe-166d877c99b8-35a31446-15a1-4813-b489-89a208e5c179: execution expired",
      "title": "CF-RunnerUnavailable"
    }
  ],
  "guid": "a42634d2-2b91-4030-8e57-0d89e37070cb",
  "links": {
    "self": {
      "href": "https://redacted.sapcloud.io/v3/jobs/a42634d2-2b91-4030-8e57-0d89e37070cb"
    },
    "space": {
      "href": "https://redacted.sapcloud.io/v3/spaces/013822bf-4c0d-40e6-8fa7-1622d46ae59a"
    }
  },
  "operation": "space.apply_manifest",
  "state": "FAILED",
  "updated_at": "2022-10-11T09:48:22Z",
  "warnings": []
}

@will-gant will-gant marked this pull request as ready for review October 11, 2022 15:03
@philippthun philippthun self-assigned this Oct 12, 2022
Copy link
Member

@philippthun philippthun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using locket to determine the active bbs instance!

lib/cloud_controller/config_schemas/base/api_schema.rb Outdated Show resolved Hide resolved
lib/locket/client.rb Outdated Show resolved Hide resolved
lib/locket/client.rb Outdated Show resolved Hide resolved
lib/diego/client.rb Outdated Show resolved Hide resolved
lib/diego/client.rb Outdated Show resolved Hide resolved
lib/diego/client.rb Outdated Show resolved Hide resolved
lib/diego/client.rb Show resolved Hide resolved
lib/diego/client.rb Show resolved Hide resolved
lib/locket/client.rb Outdated Show resolved Hide resolved
@will-gant will-gant force-pushed the cc-bbs-communication branch from a781f23 to 69587e9 Compare October 17, 2022 08:08
lib/diego/client.rb Outdated Show resolved Hide resolved
lib/locket/client.rb Show resolved Hide resolved
@mkocher
Copy link
Member

mkocher commented Oct 19, 2022

Locket seems like an implementation detail of Diego, I'd hesitate to use it for service discovery from another component without thinking deeply about it.

Could this be solved by going through the list of IPs returned instead of picking one randomly? Presumably the successful IP could be cached in memory and used until Diego goes through another leader election?

@will-gant will-gant force-pushed the cc-bbs-communication branch from 8a11503 to 3c7806e Compare October 20, 2022 14:03
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 20, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

will-gant added a commit to sap-contributions/capi-release that referenced this pull request Oct 20, 2022
Renamed this to Locket::Client and changed its initialization args so it can be used where the 'owner' of a key isn't already known. This enables Diego::Client to use it to retrieve the bosh instance id of the bbs server that is currently active.
Get the bosh instance id for the currently-active bbs process from locket, so that Diego::Client can use an instance-specific bosh-dns URL instead of one that resolves to all IPs in the instance group (i.e. fd0458bf-71f9-4df2-9ab1-5fb523a92e56.bbs.service.cf.internal, not bbs.service.cf.internal).

Mitigates the risk of Runner is unavailable errors during a cf push when the client is unable to reach a bbs instance.

bbs operates as a cluster. At any one time one instance is 'active', and able to serve responses to API calls, while others are inactive. An active instance holds the bbs lock from locket.

At present, whenever the Diego::Client needs to communicate with bbs it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release is https://bbs.service.cf.internal:8889. This domain is defined by cf-deployment as a bosh-dns alias for the diego-api instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order).

When all bbs instances are reachable, you either hit the active instance first, or you hit an inactive one (connection refused) and the http client tries the next IP in the list. But if the first IP in the list is unreachable, the request hangs until it times out (by default after 10s) and then fails.

The Diego::Client makes three attempts to reach the active bbs instance before giving up and raising an error. Say you've got two availability zones with one diego-api instance in each, and a network issue makes one unreachable. In such a case, every API call currently has a 1/8 chance of that the first IP in the list is the unreachable one three times in a row, which will cause cf push and other commands to fail.
@will-gant will-gant force-pushed the cc-bbs-communication branch from 3c7806e to 7a5681a Compare October 20, 2022 14:09
@will-gant will-gant requested a review from philippthun October 20, 2022 14:31
@will-gant
Copy link
Contributor Author

will-gant commented Oct 20, 2022

Locket seems like an implementation detail of Diego, I'd hesitate to use it for service discovery from another component without thinking deeply about it.

Definitely a valid concern. At the same time it's worth weighing against that the fact that, aside from Diego itself, there are four other components packaged in cf-deployment with clients that talk to Locket. Those are the routing-api, tps, policy-server-asg-syncer and cc_deployment_updater. The latter is part of cloud_controller_ng, and has had a dependency on locket since CAPI 1.63.0, in 2018.

The risk is lowered a bit more thanks to the fact that we're doing this with a quite modest tweak to a pre-existing wrapper around a gRPC client whose ruby code has been generated from a protofile. That provides some insurance against things changing.

Given the above, and the relatively superficial nature of the dependency I've added, I feel on balance it's unlikely that future changes in the way Diego uses Locket are going to be drastic enough to risk serious disruption for us (or that the CC's usage would deter such refactoring).

Could this be solved by going through the list of IPs returned instead of picking one randomly? Presumably the successful IP could be cached in memory and used until Diego goes through another leader election?

Currently the CC uses the httpclient, which handles DNS resolution. The alternative would, I guess, be for Diego::Client to use another library to do an 'explicit' lookup with bosh-dns, loop over the IP list returned, and then try a httpclient.post on each IP, rescuing on the failures. Then cache and reuse the IP for the active server. This would work, though I think having the client doing DNS lookups like this and continuing to use trial-and-error to discover the server (albeit with the caching) isn't ideal.

It'd be good to hear from others, as the fact that I've just finished working on a different solution biases my opinion somewhat!

@mkocher

@will-gant
Copy link
Contributor Author

@sethboyles If you get time, it'd be great to get your thoughts on this!

@will-gant
Copy link
Contributor Author

will-gant commented Oct 21, 2022

Also threw together a couple of diagrams. First one shows what can happen today if there's an outage and the first IP in the randomly-ordered list that bbs.service.cf.internal resolves to is, by chance, an unreachable diego-api instance three times in a row:

diego-bbs-comms-1

The second one shows how this would work with the change I'm proposing:

diego-bbs-comms-2

lib/diego/client.rb Outdated Show resolved Hide resolved
spec/unit/lib/locket/lock_worker_spec.rb Outdated Show resolved Hide resolved
spec/unit/lib/locket/lock_worker_spec.rb Outdated Show resolved Hide resolved
spec/diego/client_spec.rb Outdated Show resolved Hide resolved
spec/diego/client_spec.rb Outdated Show resolved Hide resolved
Copy link
Member

@philippthun philippthun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I won't approve until we've sorted out the concerns raised by @mkocher.

@will-gant will-gant force-pushed the cc-bbs-communication branch from b233f24 to 9c72962 Compare October 21, 2022 16:24
@sethboyles
Copy link
Member

sethboyles commented Oct 21, 2022

At the same time it's worth weighing against that the fact that, aside from Diego itself, there are four other components packaged in cf-deployment with clients that talk to Locket

Both TPS and the deployment-updater are a part of CAPI, but as far as I can tell both of them use Locket to register their own locks, not query information about Diego.

I don't have much of an opinion yet on whether or not using Locket to query information about Diego is a pattern we should avoid or not, but I know little about Locket and how Diego uses it. Perhaps we should reach out to the Diego team for their thoughts?

@will-gant
Copy link
Contributor Author

will-gant commented Oct 21, 2022

Yep I'll ask. @ameowlia, @geofffranks, @jrussett - I wonder if any of yourselves, or your other Diego colleagues, would have an opinion on the proposed use of Locket here?

@geofffranks
Copy link
Contributor

This seems like a great poposal to me. The data is authoritative about who the active BBS VM is. There's precedent for this in the cfdot locks command (pulls data from locket to display the active BBS + Aucitoneer VMs).

@jpalermo
Copy link
Member

Wait, hold on, it feels very wrong to have a service running on the platform that a client would have to inspect which instances of the service are healthy and then target them directly.

bbs.service.cf.internal should not return instances that do not actually work, which is what it sounds like current exists.

My first thought was that bosh-dns healthiness could be used to ensure the "unhealthy" VMs aren't actually returned by bbs.service.cf.internal, and that does work great, right up until you co-locate the bbs job on a VM with other jobs, now the whole VM is unhealthy for bosh dns query purposes, which is probably bad.

Using process level healthiness does "solve" the problem with s3 queries mentioned above, because you could just switch to s0 and in the case where another job on the active bbs VM is failing, they would all be in an "unhealthy" state, and you'd basically fall back to the behavior we have now. But again, people do co-locate jobs differently than is done in cf-deployment, so until we add job level healthiness to bosh-dns-aliases, this seems like probably a bad idea.

So the other option would be to put this healthiness logic in the bbs itself, and then when a non-active VM got a request, it could proxy it to a healthy one or maybe it could just return a 302 and the clients would follow that to the proper VM? That would mean clients don't need to do any special logic and can just query bbs.service.cf.internal.

@philippthun
Copy link
Member

The origin of the problem is that the Ruby HTTP client that is used for talking to BBS is bad in handling unreachable (i.e. isolated from a network point of view) endpoints.

This is not a general problem; when you use cURL to retrieve data from bbs.service.cf.internal and there is an unreachable VM, the request to this VM times out very fast and the next IP address is used. So in general this is something that a client could handle.

I don't think that Diego/BBS is doing something wrong; rejecting connections on an inactive VM seems like a reasonable approach and works well - the issue we are seeing is only related to unreachable VMs.

In general I see three different possibilities to solve this:

  1. Fix/enhance the currently used Ruby HTTP client.
  2. Identify and use a different Ruby HTTP client that behaves more like cURL.
  3. Use an alternative approach to identify the currently active BBS instance to circumvent issues with unreachable VMs.

  1. We tested a monkey patch (see [1]) of the TCPSocket class which introduced a connect_timeout parameter with Ruby 3.0. Setting this to a lower value than the HTTPClient.connect_timeout (default: 10 seconds), results in a similar behavior as for a refused connection, i.e. the next IP address is tried implicitly. We didn't dare to apply this monkey patch as it would affect all TCP sockets created in Ruby and we are also not sure if a contribution to the httpclient gem would be feasible.

  2. We did not go down this road so far.

  3. That's the purpose of this PR. It can be seen as a workaround for the deficient Ruby HTTP client, but it prevents the issues we are seeing with unreachable VMs.

[1]

module DefaultConnectTimeout
  DEFAULT_CONNECT_TIMEOUT = 2 # seconds

  def initialize(remote_host, remote_port, local_host = nil, local_port = nil, connect_timeout: nil)
    super(remote_host, remote_port, local_host, local_port, connect_timeout: connect_timeout || DEFAULT_CONNECT_TIMEOUT)
  end
end

class TCPSocket
  prepend DefaultConnectTimeout
end

@jpalermo
Copy link
Member

Thanks for those details @philippthun that does make it seem less like a Diego problem than I was saying and I'm no longer thinking the problem should be solved elsewhere.

That's unfortunate that the timeout isn't configurable in the http client, that would be an ideal solution.

@sethboyles
Copy link
Member

I think it's worth evaluating other HTTP clients to see if there is support for this before adding a workaround on our side.

@sethboyles
Copy link
Member

httpclient hasn't had a release in 6 years: https://rubygems.org/gems/httpclient

@will-gant
Copy link
Contributor Author

Ok given the mixed reaction to this I'm going to close it and we'll take a look at alternative HTTP clients.

@will-gant
Copy link
Contributor Author

Picking this story up again. Just noting that as far as I can tell there doesn't seem to be any (maintained) ruby HTTP client that, for a domain that resolves to multiple IPs, will try the next IP in the list if the first one it tries times out.

So I think we'll just need to resolve the DNS ourselves as @mkocher suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants