Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support behaviour changes based on URIs with a 'mesh-' prefix #1048

Merged
merged 8 commits into from
Nov 10, 2020

Conversation

iamdanfox
Copy link
Contributor

@iamdanfox iamdanfox commented Nov 9, 2020

Before this PR

@whickman and the Network Infra team are working on adopting a service mesh in k8s, and when they do, we don't really want to have two layers of retrying (once in the mesh and once in dialogue) as this would result in multiplicatively slower failures.

After this PR

==COMMIT_MSG==
Prefixing uris with mesh- now turns off all retries.
==COMMIT_MSG==

Some open questions

  • is turning off all retries the right thing to do here? what about SocketExceptions?
  • do we also want to turn off concurrency limiters?
  • shall we add a tag/something to all our metrics/logs/tracing spans so that we can clearly differentiate support tickets that are mesh-related from non-mesh related ones?

Alternative approaches were considered:

  • adding some extra first-class yaml config to ServiceConfiguration. This was rejected because it will be extremely laborious to wire into all our discovery
  • returning a magic header. This would add a lot of complexity to dialogue because clients become a lot more internally mutable, and would potentially flip between mesh and non-mesh based on this header.

Possible downsides?

  • there isn't an integration test in this repository that actually exercises the codepath of sending requests over a service-mesh like envoy/istio...

@changelog-app
Copy link

changelog-app bot commented Nov 9, 2020

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Prefixing uris with mesh- now turns off retries, as well as host and endpoint-level concurrency limits, as these should be handled by the service mesh proxy.

Check the box to generate changelog(s)

  • Generate changelog entry

@policy-bot policy-bot bot requested a review from fawind November 9, 2020 18:35
@iamdanfox iamdanfox requested review from ferozco, whickman and carterkozak and removed request for fawind November 9, 2020 18:35
@@ -81,4 +137,8 @@ default void check() {
SafeArg.of("numUris", rawConfig().uris().size()));
}
}

static String stripMeshPrefix(String input) {
return input.startsWith(MESH_PREFIX) ? input.substring(MESH_PREFIX.length()) : input;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the apache client allows custom protocol handlers, so we can avoid any url rewriting, and unsafe messages will include the mesh prefix for easier debugging. wdyt?

Copy link
Contributor Author

@iamdanfox iamdanfox Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i guess this decision kinda impacts whether this mesh- flag should be allowed to affect stuff inside whatever 'ChannelFactory' we're using (i.e. apache-land), or whether it's a purely for enabling/disabling things in dialogue-core.

Personally, I'd prefer to keep this dialogue internal, because if we're asking more people to make decisions based on this then I think we should find a way of exposing it in a nice typed way.

.taggedMetricRegistry(
VersionedTaggedMetricRegistry.create(rawConfig().taggedMetricRegistry()))
.build();
}

@Value.Derived
default MeshMode mesh() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep all of the meshmode stuff internal to DialogueChannel? i.e. perform the computation internally instead of having a derived field on the config object?

Copy link
Contributor Author

@iamdanfox iamdanfox Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we could, but then there's a possibility of constructing a Config object which is in a non-sensical state - i.e. has mesh- prefixed uris, but doesn't have MeshMode.USE_EXTERNAL_MESH set. The reason I think it's appropriate & desirable here is to emphasise to people that this isn't an extra degree of freedom in how you build a Config object, instead it's just purely derived from the URIs that you already put in. This is an example of the "make illegal states unrepresentable" pattern.

Otherwise, to keep this safety, we'd just have to put in a Value.Check that computes the mesh mode.

Third option is to not even have the mesh() param on this object, which would then require plumbing it in everywhere that we currently pass Config cf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least within the scope of this PR it seems like there are only 2 places that actually care about the mesh mode stuff, and in both cases they didn't previous require the entire Config object to be initialized. So it doesn't seem like we're saving much by pushing the mesh-mode stuff into the config object and it just exposes additional API on the Config.

Copy link
Contributor Author

@iamdanfox iamdanfox Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config is package private, so the only people who will be affected by this decision will be us / future maintainers of dialogue.

Back in april I found dialogue channel had grown to become really hard to read, so condensed all of the 'inputs' into this one Config object. That enables the pattern where Channels are created with this convention:

HostMetricsChannel.create(cf, channel, uri)
QueuedChannel.create(cf, endpoint, limited)
DeprecationWarningChannel.create(cf, channel, endpoint);
TimingEndpointChannel.create(cf, channel, endpoint);

In each of these cases, Config cf is only plumbed to the static factory method, so the actual Channel's constructor has only the bits it really needs (e.g. a clock or a channelname string etc), which makes it convenient to supply only exactly what you need for unit-testing.

It might seem minor, but i do feel kinda strongly about this... I think this convention helps us minimize plumbing, keeping the top level DialogueChannel as concise as possible so that you can follow which channels are wired into which other channels without getting distracted.

@bulldozer-bot bulldozer-bot bot merged commit 9f407bd into develop Nov 10, 2020
@bulldozer-bot bulldozer-bot bot deleted the dfox/mesh-prefix branch November 10, 2020 19:14
@iamdanfox
Copy link
Contributor Author

For future readers, @whickman shared the relevant quip where all these options were evaluated was https://palantir.quip.com/WrjDAIQZNRbT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants