diff --git a/keps/sig-multicluster/1645-multi-cluster-services-api/README.md b/keps/sig-multicluster/1645-multi-cluster-services-api/README.md new file mode 100644 index 00000000000..eeb2ad8b8a2 --- /dev/null +++ b/keps/sig-multicluster/1645-multi-cluster-services-api/README.md @@ -0,0 +1,925 @@ + +# KEP-1645: Multi-Cluster Services API + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Terminology](#terminology) + - [User Stories (optional)](#user-stories-optional) + - [Different Services Each Deployed to Separate Cluster](#different-services-each-deployed-to-separate-cluster) + - [Single Service Deployed to Multiple Clusters](#single-service-deployed-to-multiple-clusters) + - [Notes/Constraints/Caveats (optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Exporting Services](#exporting-services) + - [Restricting Exports](#restricting-exports) + - [Exported Service Behavior Expectations](#exported-service-behavior-expectations) + - [SuperclusterIP](#superclusterip) + - [DNS](#dns) + - [EndpointSlice](#endpointslice) + - [Endpoint TTL](#endpoint-ttl) + - [Service Types](#service-types) + - [Consumption of EndpointSlice](#consumption-of-endpointslice) +- [Constraints and Conflict Resolution](#constraints-and-conflict-resolution) + - [Global Properties](#global-properties) + - [Service Port](#service-port) + - [IP Family](#ip-family) + - [Component Level Properties](#component-level-properties) + - [Session Affinity](#session-affinity) + - [TopologyKeys](#topologykeys) + - [Publish Not-Ready Addresses](#publish-not-ready-addresses) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [ObjectReference in ServiceExport.Spec to directly map to a Service](#-in--to-directly-map-to-a-service) + - [Export services via label selector](#export-services-via-label-selector) + - [Export via annotation](#export-via-annotation) +- [Infrastructure Needed (optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +- [ ] Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] KEP approvers have approved the KEP status as `implementable` +- [ ] Design details are appropriately documented +- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [ ] Graduation criteria is in place +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + +There is currently no standard way to connect or even think about Kubernetes +services beyond the cluster boundary, but we increasingly see users deploy +applications across multiple clusters designed to work in concert. This KEP +proposes a new API to extend the service concept across multiple clusters. It +aims for minimal additional configuration, making multi-cluster services as easy +to use as in-cluster services, and leaves room for multiple implementations. + +*Converted from this [original proposal doc](http://bit.ly/k8s-mc-svc-api-proposal).* + + +## Motivation + + +There are [many +reasons](http://bit.ly/k8s-multicluster-conversation-starter-doc) why a K8s user +may want to split their deployments across multiple clusters, but still retain +mutual dependencies between workloads running in those clusters. Today the +cluster is a hard boundary, and a service is opaque to a remote K8s consumer +that would otherwise be able to make use of metadata (e.g. endpoint toplogy) to +better direct traffic. To support failover or temporarily during migration, +users may want to consume services spread across clusters, but today that +requires non-trivial bespoke solutions. + +The Multi-Cluster Services API aims to fix these problems. + +### Goals + + +- Define a minimal API to support service discovery and consumption across clusters. + - Consume a service in another cluster. + - Consume a service deployed in multiple clusters as a single service. +- When a service is consumed from another cluster its behavior should be + predictable and consistent with how it would be consumed within its own + cluster. +- Create building blocks for multi-cluster tooling. +- Support multiple implementations. +- Leave room for future extension and new use cases. + +### Non-Goals + + +- Define specific implementation details beyond general API behavior. +- Change behavior of single cluster services in any way. +- Define what NetworkPolicy means for multi-cluster services. +- Solve mechanics of multi-cluster service orchestration. + +## Proposal + + +#### Terminology + +- **supercluster** - A placeholder name for a group of clusters with a high + degree of mutual trust and shared ownership that share services amongst + themselves. Membership in a supercluster is symmetric and transitive. The set + of member clusters are mutually aware, and agree about their collective + association. +- **mcsd-controller** - A controller that syncs services across clusters and + makes them available for multi-cluster service discovery (MCSD) and + connectivity. There may be multiple implementations, this doc describes + expected common behavior. + +We propose a new CRD called `ServiceExport`, used to specify which services +should be exposed across all clusters in the supercluster. `ServiceExports` must +be created in each cluster that the underlying `Service` resides in. Creation of +a `ServiceExport` in a cluster will signify that the `Service` with the same +name and namespace as the export should be visible to other clusters in the +supercluster. + +Another CRD called `ServiceImport` will be introduced to store information +about the services exported from each cluster, e.g. topology. This is analogous +to the traditional `Service` type in Kubernetes. Each cluster will have a +coresponding `ServiceImport` for each uniquely named `Service` that has been +exported within the supercluster, referenced by namespaced name. + +If multiple clusters export a `Service` with the same namespaced name, they will +be recognized as a single combined service. For example, if 5 clusters export +`my-svc.my-ns`, there will be one `ServiceImport` named `my-svc` in the +`my-ns` namespace and it will be associated with endpoints from all exporting +clusters. Properties of the `ServiceImport` (e.g. ports, topology) will be +derived from a merger of component `Service` properties. + +Existing implementations of Kubernetes Service API (e.g. kube-proxy) can be +extended to present `ServiceImports` alongside traditional `Services`. + + +### User Stories (optional) + + + +#### Different Services Each Deployed to Separate Cluster + +I have 2 clusters, each running different services managed by different teams, +where services from one team depend on services from the other team. I want to +ensure that a service from one team can discover a service from the other team +(via DNS resolving to VIP), regardless of the cluster that they reside in. In +addition, I want to make sure that if the dependent service is migrated to +another cluster, the dependee is not impacted. + +#### Single Service Deployed to Multiple Clusters + +I have deployed my stateless service to multiple clusters for redundancy or +scale. Now I want to propagate topologically-aware service endpoints (local, +regional, global) to all clusters, so that other services in my clusters can +access instances of this service in priority order based on availability and +locality. Requests to my replicated service should seamlessly transition (within +SLO for dropped requests) between instances of my service in case of failure or +removal without action by or impact on the caller. Routing to my replicated +service should optimize for cost metric (e.g.prioritize traffic local to zone, +region). + +``` +<<[UNRESOLVED]>> +Due to additional constraints that apply to stateful services (e.g. each cluster +potentially having pods with the conflicting hostnames `set-name-0`, `set-name-1`, +etc.) we are only targeting stateless services for the multi-cluster backed use +case for now. +<<[/UNRESOLVED]>> +``` + +### Notes/Constraints/Caveats (optional) + + + +### Risks and Mitigations + + + +## Design Details + + +### Exporting Services + +Services will not be visible to other clusters in the supercluster by default. +They must be explicitly marked for export by the user. This allows users to +decide exactly which services should be visible outside of the local cluster. + +Tooling may (and likely will, in the future) be built on top of this to simplify +the user experience. Some initial ideas are to allow users to specify that all +services in a given namespace or in a namespace selector or even a whole cluster +should be automatically exported by default. In that case, a `ServiceExport` +could be automatically created for all `Services`. This tooling will be designed +in a separate doc, and is secondary to the main API proposed here. + +To mark a service for export to the supercluster, a user will create a +ServiceExport CR: + +```golang +// ServiceExport declares that the associated service should be exported to +// other clusters. +type ServiceExport struct { + metav1.TypeMeta `json:",inline"` + // +optional + metav1.ObjectMeta `json:"metadata,omitempty"` + // +optional + Status ServiceExportStatus `json:"status,omitempty"` +} + +// ServiceExportStatus contains the current status of an export. +type ServiceExportStatus struct { + // +optional + // +patchStrategy=merge + // +patchMergeKey=type + // +listType=map + // +listMapKey=type + Conditions []ServiceExportCondition `json:"conditions,omitempty"` +} + +// ServiceExportConditionType identifies a specific condition. +type ServiceExportConditionType string + +const { + // ServiceExportInitialized means the service export has been noticed + // by the controller, has passed validation, has appropriate finalizers + // set, and any required supercluster resources like the IP have been + // reserved + ServiceExportInitialized ServiceExportConditionType = "Initialized" + // ServiceExportExported means that the service referenced by this + // service export has been synced to all clusters in the supercluster + ServiceExportExported ServiceExportConditionType = "Exported" +} + +// ServiceExportCondition contains details for the current condition of this +// service export. +// +// Once [#1624](https://github.com/kubernetes/enhancements/pull/1624) is +// merged, this will be replaced by metav1.Condition. +type ServiceExportCondition struct { + Type ServiceExportConditionType `json:"type"` + // Status is one of {"True", "False", "Unknown"} + Status corev1.ConditionStatus `json:"status"` + // +optional + LastTransitionTime *metav1.Time `json:"lastTransitionTime,omitempty"` + // +optional + Reason *string `json:"reason,omitempty"` + // +optional + Message *string `json:"message,omitempty"` +} +``` +```yaml +apiVersion: multicluster.k8s.io/v1alpha1 +kind: ServiceExport +metadata: + name: my-svc + namespace: my-ns +status: + conditions: + - type: Initialized + status: "True" + lastTransitionTime: "2020-03-30T01:33:51Z" + - type: Exported + status: "True" + lastTransitionTime: "2020-03-30T01:33:55Z" +``` + +`ServiceExports` will be created within the cluster and namespace that the +service resides in and are name-mapped to the service for export - that is, they +reference the `Service` with the same name as the export. If multiple clusters +within the supercluster have `ServiceExports` with the same name and namespace, +these will be considered the same service and will be combined at the +supercluster level. + +This requires that within a supercluster, a given namespace is governed by a +single authority across all clusters. It is that authority’s responsibility to +ensure that a name is shared by multiple services within the namespace if and +only if they are instances of the same service. + +Most information about the service, including ports, backends and topology, will +continue to be stored in the Service object, which is name mapped to the service +export. + +#### Restricting Exports #### + +Cluster administrators may use RBAC rules to prevent creation of +`ServiceExports` in select namespaces. While there are no general restrictions +on which namespaces are allowed, administrators should be especially careful +about permitting exports from `kube-system` and `default`. As a best practice, +admins may want to tightly or completely prevent exports from these namespaces +unless there is a clear use case. + +### Exported Service Behavior Expectations + +#### SuperclusterIP + +When a `ServiceExport` is created, an IP address is reserved and assigned to +this supercluster `Service`. This IP may be supercluster-wide, or assigned on a +per-cluster basis. Requests to the corresponding IP from within a given cluster +will route to endpoint addresses for the aggregated Service. + +Note: this doc does not discuss `NetworkPolicy`, which cannot currently be used +to describe a policy that applies to a multi-cluster service. + +#### DNS + +When a `ServiceExport` is created, this will cause a domain name for the +multi-cluster service to become accessible from within the supercluster. The +domain name will be +`..svc.supercluster.local`. +Requests to this domain name from within the supercluster will resolve to the +supercluster VIP, which points to the endpoint addresses for pods within the +underlying `Service`(s) across the supercluster. All service consumers must use +the `*.svc.supercluster.local` name to enable supercluster routing, even if +there is a matching `Service` with the same namespaced name in the local +cluster. There will be no change to existing behavior of the `svc.cluster.local` +zone. + +#### EndpointSlice + +When a `ServiceExport` is created, this will cause `EndpointSlice` objects for +the underlying `Service` to be created in each cluster within the supercluster. +One or more `EndpointSlice` resources will exist for each cluster that exported +the `Service`, with each `EndpointSlice` containing only endpoints from its +source cluster. These `EndpointSlice` objects will be marked as managed by the +supercluster service controller, so that the endpoint slice controller doesn’t +delete them. + +``` +<<[UNRESOLVED]>> +We have not yet sorted out scalability impact here. We hope the upper bound for +imported endpoints + in-cluster endpoints will be ~= the upper bound for +in-cluster endpoints today, but this remains to be determined. +<<[/UNRESOLVED]>> +``` + +#### Endpoint TTL + +To prevent stale endpoints from persisting in the event that a cluster becomes +unreachable to the supercluster controller, each `EndpointSlice` is associated +with a lease representing connectivity with its source cluster. The supercluster +service controller is responsible for periodically renewing the lease so long as +the connection with the source cluster is confirmed alive. A separate +controller, that may run inside each cluster, is responsible for watching each +lease and removing all remaining `EndpointSlices` associated with a cluster when +that cluster’s lease expires. + +#### Service Types + +- `ClusterIP`: This is the the straightforward case most of the proposal + assumes. Each `EndpointSlice` associated with the exported service is combined + with slices from other clusters to make up the supercluster service. They will + be imported to the cluster behind the supercluster IP. + +``` +<<[UNRESOLVED re:stateful sets]>> + Today's headless services likely don't want a VIP and may not function + properly behind one. It probably doesn't make sense to export a current + headless service to the supercluster, it would work, but likely not the way + you want. +<<[/UNRESOLVED]>> +``` +- `NodePort` and `LoadBalancer`: These create `ClusterIP` services that would + sync as expected. For example If you export a `NodePort` service, the + resulting cross-cluster service will still be a supercluster IP type. You + could use node ports to access the cluster-local service in the source + cluster, but not in any other cluster, and it would only route to local + endpoints. +- `ExternalName`: It doesn't make sense to export an `ExternalName` service. + They can't be merged with other exports, and it seems like it would only + complicate deployments by even attempting to stretch them across clusters. + Instead, regular `ExternalName` type `Services` should be created in each + cluster individually. + +### Consumption of EndpointSlice + +To consume a supercluster service, users will use the domain name associated +with their `ServiceExport`. When the mcsd-controller sees a `ServiceExport`, a +`ServiceImport` will be introduced, which can be largely ignored by the user. + +An `ServiceImport` is a service that may have endpoints in other clusters. +This includes 3 scenarios: +1. This service is running entirely in different cluster(s) +1. This service has endpoints in other cluster(s) and in this cluster +1. This service is running entirely in this cluster, but is exported to other cluster(s) as well + +For each exported service, one `ServiceExport` will exist in each cluster that +runs the service. The mcsd-controller will create and maintain a derived +`ServiceImport` in each cluster within the supercluster (see: [constraints and +conflict resolution](#constraints-and-conflict-resolution)). If all `ServiceExport` instances are deleted, each +`ServiceImport` will also be deleted from all clusters. + +Since a given `ServiceImport` may be backed by multiple `EndpointSlices`, a +given `EndpointSlice` will reference its `ServiceImport` using the label +`multicluster.kubernetes.io/imported-service-name` similarly to how an +`EndpointSlice` is associated with its `Service` in a single cluster. Each +imported `EndpointSlice` will also have a +`multicluster.kubernetes.io/source-cluster` label with a registry-scoped unique +identifier for the cluster. + +```golang +// ServiceImport declares that the specified service should be exported to other clusters. +type ServiceImport struct { + metav1.TypeMeta `json:",inline"` + // +optional + metav1.ObjectMeta `json:"metadata,omitempty"` + // +optional + Spec ServiceImportSpec `json:"spec,omitempty"` +} + +// ServiceImportSpec contains the current status of an imported service and the +// information necessary to consume it +type ServiceImportSpec struct { + // +patchStrategy=merge + // +patchMergeKey=port + // +listType=map + // +listMapKey=port + // +listMapKey=protocol + Ports []ServicePort `json:"ports"` + // +optional + // +patchStrategy=merge + // +patchMergeKey=cluster + // +listType=map + // +listMapKey=cluster + Clusters []ClusterSpec `json:"clusters"` + // +optional + IPFamily corev1.IPFamily `json:"ipFamily"` + // +optional + IP string `json:"ip,omitempty"` +} + +// ClusterSpec contains service configuration mapped to a specific cluster +type ClusterSpec struct { + Cluster string `json:"cluster"` + // +optional + // +listType=set + TopologyKeys []string `json:"topologyKeys"` + // +optional + PublishNotReadyAddresses bool `json:"publishNotReadyAddresses"` + // +optional + SessionAffinity corev1.ServiceAffinity `json:"sessionAffinity"` + // +optional + SessionAffinityConfig *corev1.SessionAffinityConfig `json:"sessionAffinityConfig"` +} +``` +```yaml +apiVersion: multicluster.k8s.io/v1alpha1 +kind: ServiceImport +metadata: + name: my-svc + namespace: my-ns +spec: + ipFamily: IPv4 + ip: 42.42.42.42 + ports: + - name: http + protocol: TCP + port: 80 + clusters: + - cluster: us-west2-a-my-cluster + topologyKeys: + - topology.kubernetes.io/zone + sessionAffinity: None +--- +apiVersion: discovery.k8s.io/v1beta1 +kind: EndpointSlice +metadata: + name: imported-my-svc-cluster-b-1 + namespace: my-ns + labels: + multicluster.kubernetes.io/source-cluster: us-west2-a-my-cluster + multicluster.kubernetes.io/imported-service-name: my-svc + ownerReferences: + - apiVersion: multicluster.k8s.io/v1alpha1 + controller: false + kind: ServiceImport + name: my-svc +addressType: IPv4 +ports: + - name: http + protocol: TCP + port: 80 +endpoints: + - addresses: + - "10.1.2.3" + conditions: + ready: true + topology: + topology.kubernetes.io/zone: us-west2-a +``` + +The `ServiceImport.Spec.IP` (VIP) can be used to access this service from within this cluster. + +## Constraints and Conflict Resolution + +Exported services are derived from the properties of each component service and +their respective endpoints. However, some properties combine across exports +better than others. They generally fall into two categories: global properties, +and component-level properties. + + +### Global Properties + +These properties describe how the service should be consumed as a whole. They +directly impact service consumption and must be consistent across all child +services. If these properties are out of sync for a subset of exported services, +there is no clear way to determine how a service should be accessed. **If any +global properties have conflicts that can not be resolved, a condition will be +set on the `ServiceExport` with a description of the conflict. The service will +not be synced, and an error will be set on the status of each affected +`ServiceExport` and any previously-derived `ServiceImports` will be deleted +from each cluster in the supercluster.** + + +#### Service Port + +A derived service will be accessible with the supercluster IP at the ports +dictated by child services. If the external properties of service ports for a +set of exported services don’t match, we won’t know which port is the correct +choice for a service. For example, if two exported services use different ports +with the name “http”, which port is correct? What if a service uses the same +port with different names? As long as there are no conflicts (different ports +with the same name), the supercluster service will expose the superset of +service ports declared on its component services. If a user wants to change a +service port in a conflicting way, we recommend deploying a new service or +making the change in non-conflicting phases. + + +#### IP Family + +Because IPv4 and IPv6 addresses cannot be safely intermingled (e.g. iptables +rules can not mix IPv4 and IPv6), all component exported services making up a +supercluster service must use the same `IPFamily`. As with the single cluster +case - a service’s `IPFamily` is immutable - changing families will require a +new service to be created. + + +### Component Level Properties + +These properties are export-specific and pertain only to the subset of endpoints +backed by a single instance of each exported service. They may be safely carried +throughout the supercluster without risk of conflict. We propagate these +properties forward with no attempt to merge or alter them. + + +#### Session Affinity + +Session affinity affects a service as a whole for a given consumer. What would +it mean for a service to have e.g. client IP session affinity set for half its +backends? Would sessions only be sticky for those backends, or would there be no +affinity? If sessions are selectively sticky, we’d expect to see traffic to skew +toward the sticky subset of endpoints. That said, there’s nothing preventing us +from applying affinity on a per-slice basis so we will carry it forward. + + +#### TopologyKeys + +A `Service`’s `topologyKeys` dictate how endpoints in all `EndpointSlices` +associated with a given service should be applied to each node. While a single +`Service` may have multiple `EndpointSlices`, each `EndpointSlice` will only +ever originate from a single `Service`. `ServiceImport` will contain +label-mapped lists of `topologyKeys` synced from each originating exported +service. Kube-proxy will filter endpoints in each slice based only on the +`topologyKeys` defined on the slice’s specific source `Service`. + +#### Publish Not-Ready Addresses + +Like `topologyKeys` above, we can apply `publishNotReadyAddresses` at the +per-slice level based on the originating cluster. This will allow incremental +rollout of changes without any risk of conflict. When true for a cluster, the +supercluster service DNS implementation must expose not-ready addresses for +slices from that cluster. + +### Test Plan + + + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +### `ObjectReference` in `ServiceExport.Spec` to directly map to a Service + +Instead of name mapping, we could use an explicit `ObjectReference` in a +`ServiceExport.Spec`. This feels familiar and more explicit, but fundamentally +changes certain characteristics of the API. Name mapping means that the export +must be in the same namespace as the `Service` it exports, allowing existing RBAC +rules to restrict export rights to current namespace owners. We are building on +the concept that a namespace belongs to a single owner, and it should be the +`Service` owner who controls whether or not a given `Service` is exported. Using +`ObjectReference` instead would also open the possibility of having multiple +exports acting on a single service and would require more effort to determine if +a given service has been exported. + +The above issues could also be solved via controller logic, but we would risk +differing implementations. Name mapping enforces behavior at the API. + +### Export services via label selector + +Instead of name mapping, `ServiceExport` could have a +`ServiceExport.Spec.ServiceSelector` to select matching services for export. +This approach would make it easy to simply export all services with a given +label applied and would still scope exports to a namespace, but shares other +issues with the `ObjectReference` approach above: + +- Multiple `ServiceExports` may export a given `Service`, what would that mean? +- Determining whether or not a service is exported means seaching + `ServiceExports` for a matching selector. + +Though multiple services may match a single export, the act of exporting would +still be independent for individual services. A report of status for each export +seems like it belongs on a service-specific resource. + +With name mapping it should be relatively easy to build generic or custom logic +to automatically ensure a `ServiceExport` exists for each `Service` matching a +selector - perhaps by introducing something like a `ServiceExportPolicy` +resource (out of scope for this KEP). This would solve the above issues but +retain the flexibility of selectors. + +### Export via annotation + +`ServiceExport` as described has no spec and seems like it could just be +replaced with an annotation, e.g. `multicluster.kubernetes.io/export`. When a +service is found with the annotation, it would be considered marked for export +to the supercluster. The controller would then create `EndpointSlices` and an +`ServiceImport` in each cluster exactly as described above. Unfortunately, +`Service` does not have an extensible status and there is no way to represent +the state of the export on the annotated `Service`. We could extend +`Service.Status` to include `Conditions` and provide the flexibility we need, +but requiring changes to `Service` makes this a much more invasive proposal to +achieve the same result. As the use of a multi-cluster service implementation +would be an optional addon, it doesn't warrant a change to such a fundamental +resource. + +## Infrastructure Needed (optional) + + diff --git a/keps/sig-multicluster/1645-multi-cluster-services-api/kep.yaml b/keps/sig-multicluster/1645-multi-cluster-services-api/kep.yaml new file mode 100644 index 00000000000..7222ce0c64b --- /dev/null +++ b/keps/sig-multicluster/1645-multi-cluster-services-api/kep.yaml @@ -0,0 +1,14 @@ +title: Multi-Cluster Services API +kep-number: 1645 +authors: + - "@jeremyot" +owning-sig: sig-multicluster +participating-sigs: + - sig-network +status: provisional +creation-date: 2020-03-30 +reviewers: + - TBD +approvers: + - "@pmorie" + - "@thockin"