draft of multicluster phase 2 design #31

adleong · 2020-06-19T00:15:53Z

Signed-off-by: Alex Leong [email protected]

Signed-off-by: Alex Leong <[email protected]>

alpeb · 2020-06-19T20:31:07Z

Have you considered using a CRD to handle each "link" against a target cluster? Would provide us easy config validation, and target clusters properties would become queriable through kubectl.
We could also go as far as handling link creation/teardown and overall lifecycle through webhooks...
Thoughts?

Signed-off-by: Alex Leong <[email protected]>

adleong · 2020-06-19T21:54:20Z

@alpeb Thanks for bringing this up, it's a good idea to consider.

Can you elaborate a bit more on how handing lifecycle through webhooks would look?

After my experience with ServiceProfiles, I have to say I'm not a huge fan of CRDs. client-go handles them using generated code which I've found to be very brittle and extremely difficult to version. Making any changes to the CRD makes it extremely difficult or in some cases impossible to upgrade gracefully.

In this case specifically, this config is consumed only by the service-mirror-controller and is not involved in any coordination between components. This means that we can have the service-mirror-controller be responsible for parsing and validation. Since we're in total control the parsing and validation logic, we can have it be flexible enough to gracefully handle upgrades and changes to the schema.

alpeb · 2020-06-22T15:18:54Z

I was thinking more on simply having a CRD without the API aggregation, so no need to get involved with the client-go autogen code. The controller would still need to be something like our current service-mirror controller, that would subscribe as a webhook listener to all the CRUD events happening to the CRD instances. Upon the CRD creation event the endpoints and probing would need to be setup just like in Proposal 1 in this RFC. Upon the CRD deletion event the endpoints would be torn down (btw what is the proposed mechanism for tear down in Proposal 1?). As in the current proposal, using CRD alleviates us from having to deal with watchers. But we could still not use webhooks and keep on using watchers. Also note this implies sticking with the current single-controller approach.

The main advantage in my view of a CRD vs a ConfigMap is that changes made to it are immediately validated; if the user makes a mistake the resource will be rejected upon applying it, whereas errors in the ConfigMap will have to be surfaced elsewhere.

As for versioning, I'm out of the loop regarding our current difficulties, and whether that's relevant only to API aggregation. We can have a very minimal OpenAPI schema for the CRD and the service mirror controller (or a separate controller) can act as a validator webhook that we can evolve as needed.

An then kubectl can query these resources, and have kubectl get have columns showing things like the target cluster name, the service selector, etc.

grampelberg · 2020-06-22T16:49:29Z

I like using CRDs when start having more than one piece of data that requires configuration. As we'll need a couple here, feels like a good improvement.

adleong · 2020-06-22T20:26:14Z

Having instant validation via a validating webhook and having more useful data in kubectl get are definitely compelling reasons to use a CRD for this.

It looks like the service mirror controller could probably use the dynamic client package to get the custom resource as an Unstructured. Is that what you had in mind, @alpeb?

I still would prefer to have a separate service mirror controller to handle each target cluster. But it could certainly load its configuration from a custom resource instead of a confgmap.

To answer your question about link teardown: tear down involves deleting all of the resources installed by link (service mirror controller, config, credentials secret, and RBAC) as well as mirror resources (mirror services, mirror endpoints, gateway mirror, gateway mirror endpoints). If we use a label on all of these resources, it should be fairly straightforward:

linkerd --context=source delete -A (huge list of resource types) -l multicluster.linkerd.io/remote-cluster=target

But this is a tricky cumbersome command that includes a huge list of resource types. To make this easier, we could bundle this functionality into an unlink command which outputs manifests to delete, just like linkerd uninstall does.

linkerd --context=source multicluster unlink --cluster-name=target | kubectl --context=source delete -f -

alpeb · 2020-06-23T14:26:51Z

It looks like the service mirror controller could probably use the dynamic client package to get the custom resource as an Unstructured. Is that what you had in mind, @alpeb?

Requests to webhooks are raw bytes, so I was thinking on simply marshalling that with some ad-hoc struct, but Unstructured might be cleaner indeed.

Signed-off-by: Alex Leong <[email protected]>

olix0r · 2020-06-23T18:42:55Z

design/0004-multicluster-phase-2.md

+
+```
+linkerd --context=source multicluster unlink --cluster-name=foobar | kubectl delete -f -


What's the behavior if no name is specified?

Are all clusters unlinked? That would be bad 'default' behavior...

Should this require a list of cluster names, e.g. ulink foobar?

Just as in the link command, the --cluster-name flag is mandatory.

We also describe an explicit --all-clusters flag below.

olix0r · 2020-06-23T18:47:32Z

design/0004-multicluster-phase-2.md

+## Upgrading
+
+To upgrade the service mirror controller, simply run the link step again with a
+newer version of the CLI:


What are the dependencies between clusters? I.e. if i've got a gateway (target) cluster at version N and a mirror (source) cluster at version M, what are the things that must be common to both versions? Could these change? How will they change?

I also think this is oversimplified. How do you upgrade the CRD, for instance? Can't we "break" existing mirror configurations with changes to this CRD?

It's hard to speculate about dependencies between clusters without knowing in advance of WHAT will change. For example, the link command extracts certain information from the target cluster such as the gateway address and the service account token. If the location or format of this information has changed, this will not work.

I suspect that you would have to upgrade the target cluster first in this case by running linkerd mc install before running link again. But I'm not sure this speculation is very useful without knowing in advance what is changing.

We avoid using generated code to interact with the CRD specifically to avoid issues during upgrades. The service mirror controller will read the link resource using a dynamic client that gets the contents as an untyped map. This allows us to build in our own backwards compatibility and ensure that if the format of the CRD changes that the service mirror controller is able to handle both the old and new format.

olix0r · 2020-06-23T18:48:44Z

design/0004-multicluster-phase-2.md

+linkerd --context=target multicluster link --cluster-name=foobar | kubectl --context=source apply -f -
+```
+
+Alternatively, we can add an `update` subcommand to `multicluster` which finds


yes, thank you

alpeb · 2020-06-24T21:15:20Z

design/0004-multicluster-phase-2.md

+kubectl --context=source get links
+NAME   NUM SVC   ALIVE   LATENCY_P50   LATENCY_P95   LATENCY_P99
+foobar       3    True         275ms         385ms         397ms
+```


This might delve a little into the implementation design phase, but I just wanted to dig a little bit and throw some high level ideas on how this could be addressed...

As I understand it, extra printer columns are defined so that they show the values of some properties of the CRD instance, retrieved through a jsonpath expression. Stuff like the target cluster name and the services selector are readily available. But metrics like latencies would need to be actively surfaced into the CRD instance, as status properties I guess (is this what the status subresource is for?). The probe worker itself could update that status as it updates the latencies prometheus metrics.

This PR removes the service mirror controller from `linkerd mc install` to `linkerd mc link`, as described in linkerd/rfc#31. For fuller context, please see that RFC. Basic multicluster functionality works here including: * `linkerd mc install` installs the Link CRD but not any service mirror controllers * `linkerd mc link` creates a Link resource and installs a service mirror controller which uses that Link * The service mirror controller creates and manages mirror services, a gateway mirror, and their endpoints. * The `linkerd mc gateways` command lists all linked target clusters, their liveliness, and probe latences. * The `linkerd check` multicluster checks have been updated for the new architecture. Several checks have been rendered obsolete by the new architecture and have been removed. The following are known issues requiring further work: * the service mirror controller uses the existing `mirror.linkerd.io/gateway-name` and `mirror.linkerd.io/gateway-ns` annotations to select which services to mirror. it does not yet support configuring a label selector. * an unlink command is needed for removing multicluster links: see #4707 * an mc uninstall command is needed for uninstalling the multicluster addon: see #4708 Signed-off-by: Alex Leong <[email protected]>

This PR removes the service mirror controller from `linkerd mc install` to `linkerd mc link`, as described in linkerd/rfc#31. For fuller context, please see that RFC. Basic multicluster functionality works here including: * `linkerd mc install` installs the Link CRD but not any service mirror controllers * `linkerd mc link` creates a Link resource and installs a service mirror controller which uses that Link * The service mirror controller creates and manages mirror services, a gateway mirror, and their endpoints. * The `linkerd mc gateways` command lists all linked target clusters, their liveliness, and probe latences. * The `linkerd check` multicluster checks have been updated for the new architecture. Several checks have been rendered obsolete by the new architecture and have been removed. The following are known issues requiring further work: * the service mirror controller uses the existing `mirror.linkerd.io/gateway-name` and `mirror.linkerd.io/gateway-ns` annotations to select which services to mirror. it does not yet support configuring a label selector. * an unlink command is needed for removing multicluster links: see linkerd#4707 * an mc uninstall command is needed for uninstalling the multicluster addon: see linkerd#4708 Signed-off-by: Alex Leong <[email protected]>

This PR removes the service mirror controller from `linkerd mc install` to `linkerd mc link`, as described in linkerd/rfc#31. For fuller context, please see that RFC. Basic multicluster functionality works here including: * `linkerd mc install` installs the Link CRD but not any service mirror controllers * `linkerd mc link` creates a Link resource and installs a service mirror controller which uses that Link * The service mirror controller creates and manages mirror services, a gateway mirror, and their endpoints. * The `linkerd mc gateways` command lists all linked target clusters, their liveliness, and probe latences. * The `linkerd check` multicluster checks have been updated for the new architecture. Several checks have been rendered obsolete by the new architecture and have been removed. The following are known issues requiring further work: * the service mirror controller uses the existing `mirror.linkerd.io/gateway-name` and `mirror.linkerd.io/gateway-ns` annotations to select which services to mirror. it does not yet support configuring a label selector. * an unlink command is needed for removing multicluster links: see linkerd#4707 * an mc uninstall command is needed for uninstalling the multicluster addon: see linkerd#4708 Signed-off-by: Alex Leong <[email protected]> Signed-off-by: Eric Solomon <[email protected]>

olix0r · 2020-09-10T19:16:06Z

This was accidentally closed when we changed the default branch to main. The base should probably be updated...

draft of multicluster phase 2 design

801e402

Signed-off-by: Alex Leong <[email protected]>

adleong requested review from grampelberg, olix0r and zaharidichev June 19, 2020 00:15

Add more details and UX flow

7c75c34

Signed-off-by: Alex Leong <[email protected]>

Add UX for upgrade, unlink, and uninstall. Switch to using a CRD

cf896ce

Signed-off-by: Alex Leong <[email protected]>

olix0r reviewed Jun 23, 2020

View reviewed changes

alpeb reviewed Jun 24, 2020

View reviewed changes

zaharidichev mentioned this pull request Jul 23, 2020

edge-20.7.4 linkerd/linkerd2#4785

Merged

adleong mentioned this pull request Jul 24, 2020

Add support for service-mirror selectors linkerd/linkerd2#4795

Merged

olix0r closed this Sep 10, 2020

olix0r reopened this Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft of multicluster phase 2 design #31

draft of multicluster phase 2 design #31

adleong commented Jun 19, 2020

alpeb commented Jun 19, 2020

adleong commented Jun 19, 2020

alpeb commented Jun 22, 2020

grampelberg commented Jun 22, 2020

adleong commented Jun 22, 2020

alpeb commented Jun 23, 2020

olix0r Jun 23, 2020

adleong Jun 23, 2020

olix0r Jun 23, 2020

adleong Jun 23, 2020

olix0r Jun 23, 2020

adleong Jun 23, 2020

alpeb Jun 24, 2020

olix0r commented Sep 10, 2020


		```
		linkerd --context=source multicluster unlink --cluster-name=foobar \| kubectl delete -f -

draft of multicluster phase 2 design #31

Are you sure you want to change the base?

draft of multicluster phase 2 design #31

Conversation

adleong commented Jun 19, 2020

alpeb commented Jun 19, 2020

adleong commented Jun 19, 2020

alpeb commented Jun 22, 2020

grampelberg commented Jun 22, 2020

adleong commented Jun 22, 2020

alpeb commented Jun 23, 2020

olix0r Jun 23, 2020

Choose a reason for hiding this comment

adleong Jun 23, 2020

Choose a reason for hiding this comment

olix0r Jun 23, 2020

Choose a reason for hiding this comment

adleong Jun 23, 2020

Choose a reason for hiding this comment

olix0r Jun 23, 2020

Choose a reason for hiding this comment

adleong Jun 23, 2020

Choose a reason for hiding this comment

alpeb Jun 24, 2020

Choose a reason for hiding this comment

olix0r commented Sep 10, 2020