Multi-Cluster Services API #1646

JeremyOT · 2020-03-30T07:02:17Z

Create a provisional KEP for MC services based on SIG conversations and
this original doc: http://bit.ly/k8s-mc-svc-api-proposal

k8s-ci-robot · 2020-03-30T07:02:25Z

Welcome @JeremyOT!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-03-30T07:02:25Z

Hi @JeremyOT. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JeremyOT · 2020-03-30T07:11:26Z

/sig multicluster
/cc @pmorie @thockin

Looks like sig-multicluster doesn't have a KEPs folder yet, so maybe that needs to get set up with owners first

makkes · 2020-03-30T07:53:00Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+#### Terminology
+
+- **supercluster** - a placeholder name for a group of clusters with a high degree of mutual trust and shared ownership that share services amongst themselves.
+- **mcsd-controller** - a controller that syncs services across clusters. There may be multiple implementations, this doc describes expected common behavior.


Would you mind elaborating on the meaning of mcsd here?

updated this explanation

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

smawson

Some peanut gallery questions and comments, overall this looks very nice!

smawson · 2020-03-31T19:37:11Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+
+The Multi-Cluster Services API aims to fix these problems.
+
+### Goals


Is there an explicit goal to extend the topology model to support cross-cluster topology concerns? For example region/zone affinity?

No explicit extension there - the hope is that whatever K8s does for service topology within a cluster can extend across clusters. For example, zonal affinity is already something that matters within clusters. The alpha Service Topology API would already seem to just work with multiple clusters, and recent sig-network discussion about simplifying topology to do the right thing with less configuration would apply here as well

smawson · 2020-03-31T19:40:09Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+-->
+- Define specific implementation details beyond general API behavior.
+- Change behavior of single cluster services in any way.
+- Define what NetworkPolicy means for multi-cluster services.


I think we can't leave this out, we at least have to define some semantics here. Otherwise it is potentially possible to bypass network policy controls by using multi-cluster services.

I think at a minimum there needs to be some discussion on the semantics of existing network policy and multi-cluster services. At least ip-based policies should continue to work right? But namespace-based selectors probably won't...

That's a fair point. As you say, ip-based policies should continue to work but selectors won't be able to cross the cluster boundary. I think multi-cluster network policy is critical and needs to be figured out, but I was hoping we can start to define how to extend Service first. Users will definitely want policy, but there are some mechanics that we're missing - e.g. how can we support selectors against backends in another cluster?

As for any possibility of circumventing policy in the meantime, I think that would be a risk the service owner needs to be aware of. To export an existing service you'd need to have access to the service namespace which I hope means you have permission to weigh the risk and make the decision.

Let me think about this and try to fill out some thoughts on what this means for existing policy

Cillium ClusterMesh does global services and "supercluster" network policy.
They replicate state to cluster-local etcd instances.

It may be worth reviewing and/or citing their implementation for inspiration.

Some semantics exploration done months ago over sig-multicluster and sig-network https://docs.google.com/document/d/1_QzuasJPiQ-4t8tUGODoRS2E-Q3KLJajqMyCvLPH2tY/edit

Also one implementation we started in submariner: https://github.com/submariner-io/coastguard but still not published is just a POC implementing what the document describes, and heavily based in kubefed...

I think the service topology work needs to be also called in/out here.

smawson · 2020-03-31T19:41:45Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+Another CRD called `ImportedService` will be introduced to store connection
+information about the `Services` in each cluster, e.g. topology. This is
+analogous to the traditional `Service` type in Kubernetes. Each cluster will
+have an `ImportedService` for each `Service` that has been exported within the


Maybe clarify that this is per unique service name (namespace+name), and not per service resource. So if you have the "same" service in 5 different clusters each cluster will only have a single ImportedService representing it.

Good catch - clarified

smawson · 2020-03-31T19:43:14Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+
+If multiple clusters export a `Service` with the same namespaced name, they will
+be recognized as a single combined service. The resulting `ImportedService` will
+reference endpoints from both clusters. Properties of the `ImportedService`


endpoint reference goes the other way right? EndpointSlice refers to the ImportedService. There will be an EndpointSlice for each cluster that has exported the service, all pointing to the same ImportedService.

That's right, reference isn't the right word here, I'm not trying to dig into mechanics yet. I've loosened the language a bit. The method of association between EndpointSlices and ImportedService is left for the design details

smawson · 2020-03-31T19:47:19Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+
+```
+<<[UNRESOLVED]>>
+Due to additional constraints that apply to stateful services (e.g. each cluster


A potential solution here is to allow statefully-named supercluster services. Originally I was thinking cluster name but if you want to support migration it should be something else.

This is similar to a feature to allow arbitrary supercluster names which we could add later (as a field on ExportedService for example). Then the controller would just aggregate based on supercluster name, with the default name being name.namespace.svc.supercluster.local if unspecified.

What would a statefully-named supercluster service look like? When you say it would be a field on the export, are you thinking it would be an alias for the service name - as in exportName: "my-alias" is exported as my-alias.namespace.svc.supercluster.local? That seems like it could solve e.g. export DB-A from cluster A and DB-B from cluster B.

What if I want to run something like Cassandra across multiple clusters? Then I may want a single my-cassandra.ns.svc.supercluster.local in front of the aggregate service. In that instance, it could be solved by requiring that users just use different names (cluster name prefix?) for the stateful set in each cluster, but it would be much better if we could do something that didn't require different configuration in each cluster.

For these stateful services, I think we need to get some real use-cases before we can pin it down.

https://github.com/kubernetes/dns/blob/master/docs/specification.md says

"""
A unique, system-assigned identifier for the endpoint. The exact format and source of this identifier is not prescribed by this specification. However, it must be possible to use this to identify a specific endpoint in the context of a Service. This is used in the event no explicit endpoint hostname is defined.
"""

Assuming we stay similar, this means that DNS must provide a unique name to every endpoint of every supercluster service. It does not (yet) prescribe what that name should be. Given that clusters are uncoordinated today in this regard, we should not assume they will become coordinated. The "obvious" solution is to use cluster-name as part of the name.

E.g Service 's' has 2 endpoints in cluster "cla" and 2 endpoints in cluster "clb". The supercluster DNS would have 1 VIP record s.<ns>.<suffix> and individual 4 endpoints <uniq1>.cla.s.<ns>.<suffix>, <uniq2>.cla.s.<ns>.<suffix>, <uniq3>.clb.s.<ns>.<suffix>, <uniq4>.clb.s.<ns>.<suffix>. This would allow us to respect a Pod's requested "hostname" (but only within the context of it's cluster name).

Statefulset services COULD follow the same suit, but now I question whether the "within the context of it's cluster" makes any sense. If so, good, problem solved (?). If not, then the whole model needs to be rethought.

Absent real requirements this is hard, so let's keep this as an open item - what are the expectations on supercluster DNS?

Supercluster DNS needs some definition for sure.

To further this discussion, I've been leaning toward this obvious cluster-scoped model. It seems like <hostname>.<cluster_id>.<svc>.<ns>.* would provide necessary flexibility. Each pod gets its own unique name supercluster-wide but we still respect the hostname. It does of course mean that you can't connect to a specific pod with only its hostname and service name, but would there be a situation where that's needed and you don't also know the pod's cluster? If you had supercluster-unique hostnames, it would still be possible to look up the pod's FQDN via SRV records

Two thoughts:

Careful with "cluster ID" vs "cluster name". There is/was a proposal to formalize ClusterID, and I don't know if that is what you mean. A "group-unique cluster identifier or name" is probably sufficient.

Elsewhere I suggested MAYBE .<svc>.s.<ns>.<suffix> to leave room for other kinds of names. E.g DNS for Gateway is not fully locked down.

I mean the "group-unique cluster identifier or name", adding cluster_name to the terminology section and use that moving forward so it's clear

I think that structure generally makes more sense. My take is that this needs to be driven from Service and MCS can be co-developed to adopt improvements to Service. I'm worried about trying to independently improve DNS for MCS and diverging from Service. ..svc. is not ideal, but it's consistent and there's no room for confusion coming from single cluster services

smawson · 2020-03-31T19:53:51Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+these will be considered the same service and will be combined at the
+supercluster level.
+
+This requires that within a supercluster, a given namespace is governed by a


I think we should mention somewhere that we can expand this later to support explicit control over mapping, but that the proposal is to start with identity mapping and carefully add more configurability as needed. For example we could support the arbitrary name mapping as I noted above, but that may require some notion of RBAC control over the mapped-into names (otherwise it would be easy for malicious users to inject themselves into others' names). Similarly we may want to offer control over what namespace a particular cluster is allowed to export services from. So cluster A and cluster B are in a supercluster together, and cluster A only trusts namespace Foo from cluster B, and not namespace Bar. So even if cluster B has an ExportedService in namespace Bar cluster A would not import those endpoints.

Anyway too much detail but maybe a little wording here that this proposal allows for those types of things to come in the future.

that may require some notion of RBAC control over the mapped-into names (otherwise it would be easy for malicious users to inject themselves into others' names)

Yeah this is one of the big draws for name-mapped exports. As soon as you allow cross-namespace exporting (either as source or target), we have to start creating custom RBAC. With name-mapping its easy enough to disallow creation of ServiceExport resources.

we may want to offer control over what namespace a particular cluster is allowed to export services from

This is a good point. The way I've been thinking about this would be namespace level control over whether or not exports are allowed, but there could be a cluster level control over whether or not that namespace is allowed in the cluster. Would there be a case that you'd want to actively use a namespace in all clusters, but only allow exports from certain clusters?

So cluster A and cluster B are in a supercluster together, and cluster A only trusts namespace Foo from cluster B, and not namespace Bar. So even if cluster B has an ExportedService in namespace Bar cluster A would not import those endpoints.

In my model, this would mean that cluster A does not trust namespace Bar and doesn't have anything specifically to do with cluster B. Since a given namespace should have a single owner throughout the supercluster, I take this to mean cluster A doesn't trust ns Bar's owner and as a result nothing in cluster A is allowed to access anything in (or deploy anything to) namespace Bar. This seems reasonable to me, so long as namespace Bar isn't used for a different purpose/by a different team in cluster A.

What if it was specified that so long as namespace sameness is respected, an implementation could offer to exclude subsets of clusters from importing services in a given namespace? I wonder though, in this scenario, does anyone trust Bar across cluster boundaries? If no, then maybe this comes back to just prohibiting Bar from creating exports.

smawson · 2020-03-31T19:55:58Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+-->
+#### Terminology
+
+- **supercluster** - a placeholder name for a group of clusters with a high


Not sure if I saw it in this doc, but we need to define exclusivity semantics. Specifically is a supercluster an exclusive group, or does each cluster decide what other clusters are in its supercluster? Is asymmetry allowed or is symmetry enforced? (if cluster A thinks cluster B is in its supercluster, cluster B thinks cluster A is in its supercluster)

Good point. We've been assuming it's symmetric and transitive. I've made that explicit. If cluster A thinks it's in a supercluster with B and C, then all three clusters agree.

smawson · 2020-03-31T19:58:53Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+When a `ServiceExport` is created, this will cause a domain name for the
+multi-cluster service to become accessible from within the supercluster. The
+domain name will be
+`<service-export-name>.<service-export-namespace>.svc.supercluster.local`.


Need to determine:

If the supercluster.local suffix can be overridden, and if so on a per-cluster or supercluster basis.

What happens if the user happens to have used "supercluster.local" as their suffix for their cluster.

For #2, maybe it should not be .svc. and instead something else? Then there would be no potential for conflict. .ssvc.cluster.local? .super.cluster.local?

Yeah this is a tough one. I think this needs further discussion. In an ideal world we would be able to reserve a name here, but that is not our world :). First thoughts:

I think ideally we'd have some consistent standard name here, and you could create an alias if needed. I don't see why this couldn't be per-cluster as with the VIP - you'd likely want it to be supercluster level, but that's up to the implementation since each cluster consumes a service independently based solely on local information.

Maybe something.cluster.local is a solution. It would avoid conflicts, but semantically it feels like a big stretch for cluster.local as it's explicitly not cluster local.

I was dreading coming to this.

First, we should not perpetuate .local for this. There ARE some reserved TLDs that we can use: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 calls them "user-assigned" and sets aside x[a-z]. So we could choose "supercluster.xx" or "mesh.xl" or something more creative.

Nothing we can do, short of using cluster.local, can prevent us picking a name someone has used. So we either go forth cautiously with a name that we think is exceedingly unlikely to be in use (as we did with cluster.local) or we plan to allow this to be configurable. It's already obnoxious to find the suffix of your local cluster. This will be even worse, unless sig-multi-cluster can normalize cluster-ID and group-membership metadata (which seems useful to me).

We don't really need the "svc" literal in there do we? Why not <svc>.<ns>.<suffix> ? Do we really believe we'll have supercluster names for non-services? If so, I'll propose <svc>.s.<ns>.<suffix> (because I have proposed the same structure for a reboot of in-cluster DNS).

See other comments regarding endpoint names.

We will need to come up with a carefully worded spec for supercluster DNS. Can you please put that as a graduation criteria?

https://github.com/kubernetes/dns/blob/master/docs/specification.md for reference

Agree with all points. I'll absolutely add DNS spec as a graduation criteria.

In the meantime, it seems like a few questions have come up repeatedly in discussions around this KEP:

Should the name be fixed or configurable? fixed makes life easier, but users may want customization.

Should supercluster services use a new suffix or live under cluster.local (e.g. super.cluster.local)?

What should the structure of a supercluster address be?

I don't think we need the svc literal in there, but I don't think we can say at this point that there will never be supercluster names for non-services (though I don't know what they would be). The main reason this proposal has kept the literal is for familiarity with cluster.local names, which seems like a good thing to maintain. I'm all for restructuring in-cluster DNS and supercluster DNS should follow suit, but it seems risky to develop supercluster DNS improvements independently

The main failing of svc.<suffix> is that it is used in search paths (which I do not expect here), which makes for 2 user-controlled tokens which can be ambiguous. It's also WRONG from a hierarchy point of view (services are smaller than namespaces), but I think I'd be OK to keep it for consistency.

Completely agree. Consistency is the main driver, but fixing the hierarchy in general would be great and MCS should adopt any improvements

stealthybox · 2020-04-05T06:49:54Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+`<service-export-name>.<service-export-namespace>.svc.supercluster.local`.
+Requests to this domain name from within the supercluster will resolve to the
+supercluster VIP, which points to the endpoint addresses for pods within the
+underlying `Service`(s) across the supercluster.


I know that it says that existing Service routing will not change, but can we please add a note here explicitly stating:

clients will need to use the supercluster-specific DNS zone to recieve multi-cluster routing

the existing default svc.cluster.local zone behavior is unchanged even when ImportedServices and matching EndpointSlices exist

I became concerned about suggesting we exclude default/kubernetes and kube-system/kube-dns from being exported, because I incorrectly assumed exporting them would cause undesirable changes to existing behavior until I read the DNS section more carefully.

It may also be beneficial to mention that exporting these special/canonical services is allowed and may have valid use-cases such as dns-aggregation or apiserver-monitoring. (perhaps here or earlier on when discussing use-cases or export/RBAC behavior)

Great point - I've added a note to make that clear

stealthybox · 2020-04-05T06:59:16Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+What is out of scope for this KEP?  Listing non-goals helps to focus discussion
+and make progress.
+-->
+- Define specific implementation details beyond general API behavior.


Thanks for the excellent proposal.

This design helps me understand how we could use the existing in-cluster instances of CoreDNS or a supercluster specific deployment of CoreDNS to fulfill the DNS requirements for MCSD.

This is very feasible from cluster-local CoreDNS instances since the ExportedService state is imported/replicated throughout the APIServers of the sueprcluster.

Writing a new CoreDNS plugin that modifies the kubernetes / kubernetai plugin behavior in the style of k8s_external and federation should work well.

That's great to hear! Thank you for your thorough comments and feedback!

Lighthouse plugin for CoreDNS currently has a design that is a very close fit to this proposal. Over next few weeks we're planning to come up with a PoC implementation of this proposal in Lighthouse.

stealthybox · 2020-04-05T07:07:45Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+`EndpointSlice` is associated with its `Service` in a single cluster. Each
+imported `EndpointSlice` will also have a
+`multicluster.kubernetes.io/source-cluster` label with a registry-scoped unique
+identifier for the cluster.


Do any existing core cluster-components such as kube-proxy/CoreDNS/kube-dns need to be modified to ignore these new EndpointSlices?
What about ingress-controllers, prometheus / kube-prometheus, and CNI related daemons?
Will using EndpointSlices in this way potentially affect user's scripts, automation, or integrations?

Have we considered also cloning the API into a CRD like ImportedEndpointSlices in the same way we're doing for Services?

This would make the UX for kubectl get ep,importedendpoints match kubectl get svc,serviceexports,importedservices instead of conflating one but not the other.

Existing components should not require any modification to ignore these new slices - that was one driver for using ImportedService instead of trying to extend Service. One of the nice features of EndpointSlice compared to Endpoints is that they are not strictly coupled to a Service, so as long as we don't use the kubernetes.io/service-name label, our slices should have no impact on existing components that discover them by association with a Service.

modulo scalabiity aspects (which in general we didn't yet solve), because things like dns, kube-proxy etc. watch for all endpoints (so may oom or sth...)

Good point - though existing logic shouldn't be impacted by these new unlabelled slices (at least from the Service API's POV), they will still trigger events and could potentially have noticeable side effects at scale.

stealthybox · 2020-04-05T07:15:49Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+`Service` owner who controls whether or not a given `Service` is exported. Using
+`ObjectReference` instead would also open the possibility of having multiple
+exports acting on a single service and would require more effort to determine if
+a given service has been exported.


It may be worth noting somewhere here that allowing users to create ServiceExports in the default namespace will allow people to export the default kubernetes Service.

This is more of an administrator action, but it is very common for app teams and non admins to have wide RBAC in the default namespace.

We should encourage people intending to use an MCSD implementation to be wary of this.
One suggested solution could be to not use the default namespace for apps or create specific RBAC to guard against it.
Otherwise, this seems like a set of defaults that could compose into mistakes and security issues.

I agree - default and kube-system both need to be used cautiously. There may be good reasons to allow exports from either, but it would be a best practice to restrict exports without clear justification. I've added a "Restricting Exports" section to clarify that

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

aojea · 2020-04-09T09:48:37Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+
+#### Endpoint TTL
+
+To prevent stale endpoints from persisting in the event that a cluster becomes


another important problem that use to happen in these scenarios (commonly with BGP peering) is unstability or flapping due to problems with the peers, per example because of network hiccups or other connectivity problems. If it connects and reconnects constantly it creates a lot of pressure on the clusters creating and deleting resources. Maybe we should indicate that we should add some protection to avoid this?

I don't think that right now implementations details are necessary but just for reference and completeness, BGP solves it using flap damping https://tools.ietf.org/html/rfc2439, but just whatever mechanism that guarantees certain stability will be valid

We've purposefully avoided specifying exactly how the supercluster reconciles services between clusters beyond what's necessary to implement the API, but here's an example of how I see this working:

The controller periodically validates connectivity to each cluster in the supercluster through some mechanism, then in each cluster it manages (in a centralized model it would be all clusters, in a distributed model maybe just the local cluster) it would renew the source-cluster-specific lease for each currently connected cluster. The TTL would be some multiple of the validation interval and allow for some flakiness. This way we'd only cleanup endpoints if a source cluster was unreachable for a sustained period.

For the reasons you suggest, I don't think a supercluster controller would want to explicitly delete endpoints for a cluster because it was unreachable - rather let the lease expire in that case. This way there should be no action if there's a temporary disconnect so long as it recovered before the TTL.

What if this instead read:

To prevent stale endpoints from persisting in the event that a cluster becomes
unreachable to the supercluster controller for a sustained period of time, ...

Does that make more sense?

yeah, I was more concerned about a possible overlook of this problem, but is obvious you have it very clear and well designed 😅 . Current wording works for me too. Thanks for the clarification

My big concern here is to quickly weed out endpoints that are not alive. absent real health-checking, this is hard.

We should figure out how to do better than TTL (see DNS caching)

I've been thinking about this, and maybe TTL doesn't need to be part of this spec and instead made a responsibility of the implementation. Even as described, TTL only handles the worst case scenario should a cluster or supercluster controller become unavailable or otherwise stop functioning. That's clearly necessary, but there are various ways to go about it.

An implementation should strive to do better than this - for example, quickly cleaning up endpoints that have been removed from their source cluster and/or actually perform health checking. I don't think it's necessary (or desirable) to define that behavior as preference may depend on the environment or other to prioritize other goals.

Absent a kubernetes-level healthcheck spec, implementations are free to do whatever they want, but there's no consistency. Adding an HC block to Services may be the "real" solution, but that's probably out-of-scope for this KEP.

Agreed. That said, what are thoughts on keep/removing TTL as part of the spec?

We keep it to bound worst-case endpoint cleanup behavior for all implementations, but with the understanding that implementations should strive to do better. This still means inconsistent behavior but at least guarantees that endpoints wont live forever.

We remove it and leave cleanup entirely to implementation. There's even less consistency and we open to the possibility of technically compliant implementations that leave dead endpoints forever.

It seems like a useful thing to have right now, but if we expect e.g. an HC block to be added to Services TTL may become redundant quickly as MCS adopts that instead

aojea · 2020-04-09T09:57:56Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+
+We propose a new CRD called `ServiceExport`, used to specify which services
+should be exposed across all clusters in the supercluster. `ServiceExports` must
+be created in each cluster that the underlying `Service` resides in. Creation of


The Service object is a key piece of this KEP but I didn't find any mention to the Type of Service. Is this KEP addressing any Type of service?

ClusterIP: Headless and No-Headless

NodePort

LoadBalancer

ExternalName

My impression after reading it and for the mentions to the EndpoinSlices and the SuperClusterIP is that is targeting ClusterIP services only, is that correct? 😅

You're right, this needs some more clarification. This is how I see it:

The resulting SuperclusterIP service that's imported into each cluster is like ClusterIP (no node port, no external LB IP), but the source services don't have to be. The supercluster endpoints are the union of endpoints attached to services exported from each cluster, regardless of how those endpoints are used in-cluster.

Let's start with exporting ClusterIP - this is the the straightforward case most of the doc assumes. Each EndpointSlice associated with the exported service is combined with slices from other clusters to make up the supercluster service. They will be imported to the cluster behind the supercluster IP.

Headless - Today's headless services likely don't want a VIP and may not function properly behind one - there's a note a little further down about stateful sets and complications that come with them. We haven't answered what that could look like. It probably doesn't make sense to export a current headless service to the supercluster, it would work, but likely not the way you want.

NodePort and LoadBalancer type services both create ClusterIP services and that's what would sync. For example If you export a NodePort service, the resulting cross-cluster service will still be a SuperclusterIP type. You could use node ports to access the cluster-local service in the source cluster, but not in any other cluster, and it would only route to local endpoints.

ExternalName really is the outlier here - it doesn't make sense to export an ExternalName service. They can't be merged with other exports, and it seems like it would only complicate deployments by even attempting to stretch them across clusters. Why not just create regular ExternalName services in each cluster that needs them? Perhaps we should just disallow exporting them. What do you think?

Why not make headless cluster services become headless supercluster services?

That would make sense, but mechanically what does that mean? If "headless" is derived from cluster-level services, is a supercluster service headless if any component service is headless? Only if all are? I was thinking along these lines, e.g. we create pod A/AAAA records based on exporting service headlessness, but at which point should we switch the service record from VIP to pods IPs? I don't really see a case for mixing and matching headless/non-headless services so maybe we can just say a supercluster service is headless if all component services are headless

I think we can define resolution, along the lines of "If any same-name exported cluster service has a VIP, the supercluster service also has a VIP. If all same-name exported cluster services are headless, the supercluster service is headless".

You need to be prepared to allocate a supercluster IP if one member service changes, though.

and every non-correct ServiceExport needs to get status, and maybe events.

Yeah, that makes sense to me. Adding to the KEP

Let's start with exporting ClusterIP - this is the the straightforward case most of the doc assumes. Each EndpointSlice associated with the exported service is combined with slices from other clusters to make up the supercluster service. They will be imported to the cluster behind the supercluster IP.

An alternative, why not have one EndpointSlice map to one ClusterIP in one cluster. So if a service that exists in N clusters we will have N EndpointSlice and we only need to route packets to that cluster's ClusterIP for the service. From there the kube-proxy of target cluster can route it to the pods locally. Client clusters don't need to be aware of POD IPs and will save with scale.

Is there a built-in assumption that all clusters participating in the supercluster have the ability to resolve clusterIPs services (not exposed as NodePort or LB services) and/or clusterIPs of pods between each other? Can member clusters continue to have separate network overlays but allow private NodePort/LB service routing?

The current assumption is that all participating clusters have mutual connectivity at the Pod level - clusters do not need to resolve each other's clusterIPs, but do need to access each others Pods. The SuperclusterIP introduced here is a VIP, local to the importing cluster, that routes to Pods from exporting clusters.

Connecting directly to a Service LB would start to break down if multiple clusters exported the same service. You'd have multiple LB IPs, one for each cluster, but since there's no concept of weight in kube-proxy, each IP would get an even share of traffic regardless of the number of backing pods. However, I could see configurations where only the specific pods behind an exported service are exposed through a gateway - the work to update Service APIs should help with that

aojea · 2020-04-09T10:03:50Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+supercluster service DNS implementation must expose not-ready addresses for
+slices from that cluster.
+
+### Test Plan


You can emulate these scenarios for testing with KIND
cc: @BenTheElder
I've described an example here https://github.com/kubernetes-sigs/kind/pull/1337/files#diff-7906da116e09ff32d7a6c92aaf1113abR63-R69

KIND is great :) - I plan to work on this section over the next week

Following up here- I've been playing with this for local testing and it works great. Thanks for the tip!

FYI that in v0.8.0 (unreleased but soon) nodes will move from the default bridge to a kind bridge (to get docker user defined network features, namely the embedded DNS which fixes many problems)

shouldn't affect this much hopefully, all nodes / clusters are still peers on the same docker network. kubernetes-sigs/kind#1337 may need some updates.

Thanks, good to know - i'll be following the release

aojea · 2020-04-09T10:05:39Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+  cluster required to make on upgrade in order to make use of the enhancement?
+-->
+
+### Version Skew Strategy


What will the version skew strategy here?
A Supercluster x will support individual clusters x-1, x-2, ...?

At a high level:

This comes down to the controller. The controller should be upgraded first and support current and N (TBD) previous API versions. It will need to understand how to read exports/services/endpoints from source clusters of any supported version and sync an aggregated service to each cluster in a way that remains as consistent and predictable as possible. Clusters that have not been upgraded yet won't be able to support new capabilities but they will still be in sync retaining previous behavior.

I would just pick a reasonable minimal N and strategy for now. Allowing for greater than minimal required skew post Beta stage seems like asking for trouble.

bowei · 2020-04-24T19:29:25Z

/ok-to-test

Create a provisional KEP for MC services based on SIG conversations and this original doc: http://bit.ly/k8s-mc-svc-api-proposal

- Adds a note on the use of ObjectReference to Alternatives - Clarifies the meaning of MCSD - Note on intention to use generic conditions once the relevant KEP is implemented. - Unresolved: scalability is still an open question

Importing some discussion from the public doc, adds notes on an unresolved alternative approach using label selectors instead of name mapping.

note ImportedService uniqueness

Rename based on PR feedback

JeremyOT · 2020-05-11T22:15:58Z

Thanks! rebased
/assign @justaugustus

JeremyOT · 2020-05-11T22:16:45Z

/unassign @justaugustus

JeremyOT · 2020-05-11T22:16:52Z

/assign @pmorie

vthapar · 2020-05-12T15:20:58Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+        // +optional
+        Reason *string `json:"reason,omitempty"`
+        // +optional
+        Message *string `json:"message,omitempty"`


@JeremyOT Should Message contain list of clusters that were successfully synced? Won't be better to have a list of clusters synced and yet to be synced?

The not-yet-synced clusters would make sense to list in the message, as a human operator might use them to help track down a failure. For synced clusters, the implementation/registry have the list of available clusters, and a condition representing a successful sync would mean that all clusters are synced so I'm not sure what added benefit would come from listing them. I'd worry that we could end up having to carry a large number of cluster names on every export status in the happy path.

aswinsuryan · 2020-05-13T09:03:59Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+    port: 80
+endpoints:
+  - addresses:
+      - "10.1.2.3"


Is there an expectation that the clusters should have non-overlapping Pod CIDR? or this endpoint address could be something assigned by the mcsd-controller?

The basic case that we've been working with so far is that this would be used in a flat network with non-overlapping Pod CIDR ranges, but these could be alternative addresses assigned by the controller. The important part is that, at least until kube-proxy has some concept of weighting, traffic will be spread evenly between them so they should represent equivalent backends

pmorie · 2020-05-20T20:06:04Z

/approve

k8s-ci-robot · 2020-05-20T20:06:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JeremyOT, pmorie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-multicluster/OWNERS~~ [pmorie]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pmorie · 2020-05-20T20:08:04Z

This is a great start. Let's merge so we can iterate with the numerous follow-ups that have been discussed. Thanks a lot @JeremyOT

/lgtm

mangelajo · 2020-05-21T11:55:14Z

Congratulations @JeremyOT good work.

deads2k

just keeping track of thoughts for later.

deads2k · 2020-05-29T17:18:16Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+-->
+- Define specific implementation details beyond general API behavior.
+- Change behavior of single cluster services in any way.
+- Define what NetworkPolicy means for multi-cluster services.


I think you need to describe what happens with existing policies so that namespace admins who use network policy and then export their services are not surprised by the outcome. I think it's also important to describe whether network policy can be applied to imported services.

I see the following questions

if I export a service with a restrictive network policy, will that policy...

be honored by remote clusters

create a deny-by-default behavior (this would appear safe)

do something else

if I import a service can I apply network policy to it in my local cluster

Even if changes are not proposed, being able to describe what happens is important to deciding whether or not to enable the feature.

deads2k · 2020-05-29T17:21:23Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+  expected common behavior.
+
+We propose a new CRD called `ServiceExport`, used to specify which services
+should be exposed across all clusters in the supercluster. `ServiceExports` must


Please link from here to a description of what happens when this doesn't happen. I see a couple cases

ServiceExport exists in clusterA, but not clusterB. ClusterB does not have a conflicting service

ServiceExport exists in clusterA, but not clusterB. ClusterB does have a conflicting service.

deads2k · 2020-05-29T17:22:57Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+Another CRD called `ServiceImport` will be introduced to store information
+about the services exported from each cluster, e.g. topology. This is analogous
+to the traditional `Service` type in Kubernetes. Each cluster will have a
+coresponding `ServiceImport` for each uniquely named `Service` that has been


Is it possible for a cluster to import a service exported by another cluster, but not export that same service? I'm imagining a migration case where I've moved a namespace to a different cluster and no longer provide a local endpoint.

deads2k · 2020-05-29T17:26:03Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+-->
+#### Terminology
+
+- **supercluster** - A placeholder name for a group of clusters with a high


Can a single cluster be part of multiple superclusters? For instance, can I have one supercluster for my DB2 admins, a different supercluster for my frontend developers, and a third one for corporate IT?

deads2k · 2020-05-29T17:33:43Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+-->
+### Exporting Services
+
+Services will not be visible to other clusters in the supercluster by default.


if a single cluster is allowed to be part of multiple superclusters, is there a selection criteria for which superclusters it is a part of? I would expect not and that it would be up to which clusters schose to import it (If I have understood it correctly so far).

deads2k · 2020-05-29T18:04:45Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+#### Restricting Exports ####
+
+Cluster administrators may use RBAC rules to prevent creation of
+`ServiceExports` in select namespaces. While there are no general restrictions


some more detail here would be appreciated. I'm not sure RBAC is fully sufficient, but recommendations and expectations would be ehlpful

deads2k · 2020-05-29T18:06:39Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+
+#### SuperclusterIP
+
+When a `ServiceExport` is created, an IP address is reserved and assigned to


can you walk this through with an example? I'm having some trouble working out what this means.

deads2k · 2020-05-29T18:09:31Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+When a `ServiceExport` is created, this will cause `EndpointSlice` objects for
+the underlying `Service` to be created in each cluster within the supercluster.
+One or more `EndpointSlice` resources will exist for each cluster that exported
+the `Service`, with each `EndpointSlice` containing only endpoints from its


What happens if multiple constituent clusters in the supercluster share IP address space with each other on their pod, service, or node IP ranges?

deads2k · 2020-05-29T18:16:27Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+there is no clear way to determine how a service should be accessed. **If any
+global properties have conflicts that can not be resolved, a condition will be
+set on the `ServiceExport` with a description of the conflict. The service will
+not be synced, and an error will be set on the status of each affected


if there are some endpoints that are already exported, will they be forced out why a new and incompatible set of endpoints?

deads2k · 2020-05-29T18:17:03Z

keps/sig-multicluster/1645-multi-cluster-services-api/README.md

+global properties have conflicts that can not be resolved, a condition will be
+set on the `ServiceExport` with a description of the conflict. The service will
+not be synced, and an error will be set on the status of each affected
+`ServiceExport` and any previously-derived `ServiceImports` will be deleted


this bit in particular makes it appear as though a new bad actor could bring down a previously functional service. Is there a way to contain the damage instead?

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 30, 2020

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 30, 2020

k8s-ci-robot requested review from lachie83 and mattfarina March 30, 2020 07:02

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 30, 2020

JeremyOT mentioned this pull request Mar 30, 2020

Multi-Cluster Services API #1645

Open

k8s-ci-robot requested review from pmorie and thockin March 30, 2020 07:11

k8s-ci-robot added the sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. label Mar 30, 2020

makkes reviewed Mar 30, 2020

View reviewed changes

wojtek-t reviewed Mar 30, 2020

View reviewed changes

JeremyOT requested a review from wojtek-t March 30, 2020 16:54

smawson reviewed Mar 31, 2020

View reviewed changes

stealthybox reviewed Apr 5, 2020

View reviewed changes

keps/sig-multicluster/1645-multi-cluster-services-api/README.md Show resolved Hide resolved

aojea reviewed Apr 9, 2020

View reviewed changes

k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Apr 21, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 24, 2020

JeremyOT added 10 commits May 11, 2020 15:13

Multi-Cluster Services API

bba9145

Create a provisional KEP for MC services based on SIG conversations and this original doc: http://bit.ly/k8s-mc-svc-api-proposal

Address ObjectRef and other clarifications

fa37983

- Adds a note on the use of ObjectReference to Alternatives - Clarifies the meaning of MCSD - Note on intention to use generic conditions once the relevant KEP is implemented. - Unresolved: scalability is still an open question

Add notes on a label selector approach

c9d7515

Importing some discussion from the public doc, adds notes on an unresolved alternative approach using label selectors instead of name mapping.

Add notes on export via annotation

6972290

clarify supercluster membership

7e982ee

note ImportedService uniqueness

add notes on rbac and no change to cluster.local

d515b88

add section on supported service types

7284f55

Rename ImportedService to ServiceImport

ccc1448

Rename based on PR feedback

clean up structs

7be0f38

add approvers

e73b941

JeremyOT force-pushed the mcservices branch from df8516d to e73b941 Compare May 11, 2020 22:14

k8s-ci-robot assigned justaugustus May 11, 2020

k8s-ci-robot unassigned justaugustus May 11, 2020

k8s-ci-robot assigned pmorie May 11, 2020

vthapar reviewed May 12, 2020

View reviewed changes

aswinsuryan reviewed May 13, 2020

View reviewed changes

vthapar mentioned this pull request May 18, 2020

PoC: DNS based MCS implementation submariner-io/lighthouse#142

Closed

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 20, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2020

k8s-ci-robot merged commit a8b2056 into kubernetes:master May 20, 2020

k8s-ci-robot added this to the v1.19 milestone May 20, 2020

danehans mentioned this pull request May 22, 2020

Adds Traffic Mirroring Support to HTTPRoute kubernetes-sigs/gateway-api#185

Closed

deads2k reviewed May 29, 2020

View reviewed changes


		The Multi-Cluster Services API aims to fix these problems.

		### Goals


		#### Endpoint TTL

		To prevent stale endpoints from persisting in the event that a cluster becomes


		#### SuperclusterIP

		When a `ServiceExport` is created, an IP address is reserved and assigned to

Multi-Cluster Services API #1646

Multi-Cluster Services API #1646

Conversation

JeremyOT commented Mar 30, 2020

k8s-ci-robot commented Mar 30, 2020

k8s-ci-robot commented Mar 30, 2020

JeremyOT commented Mar 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smawson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khenidak May 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stealthybox Apr 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khenidak May 15, 2020 •

edited

Loading

stealthybox Apr 5, 2020 •

edited

Loading

aojea Apr 9, 2020 •

edited

Loading