-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for service topology RFC #23
Add support for service topology RFC #23
Conversation
Signed-off-by: Matei David <[email protected]>
Signed-off-by: Matei David <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple questions from me, @adleong want to take a look?
|
||
To implement support for this feature, my main assumption is that most of the work will be carried out in the `destination service` on the control plane side. Although the feature does not add a lot of changes in terms of the Kubernetes API, several changes have to be made. The first and most obvious to me is supporting the `endpointSlices` feature that was introduced in Kubernetes `v1.16`, since this will enable endpoints to hold topology metadata. | ||
|
||
All meshed pods have a proxy that will first establish all outbound endpoints of a service the pod is targeting. Through this change, we will introduce a new weight to each endpoint based on locality, which can afterwards be used to add subsequent features that build on locality-awareness. I propose that the weights be added to each endpoint on the proxy side. To ensure backwards compatibility, there should be a separate gRPC call that can be made to the destination service to get a service's topology preference, and the host's (node) topology labels. The call would include the FQDN/IP of a service, and a topology identifier (similar to the context namespace used for service profiles). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an extra call or just a field addition to the existing response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention that weights should be added on the proxy side, what actually decides the weight itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started up the destination service locally and had a play with the test destination-client (whose existence eluded me until today). To address the questions:
Should this be an extra call or just a field addition to the existing response
After I had a look at how the response looks like in practice, I realised I made some confusions regarding how endpoint updates work.
I think it would make more sense to have an additional field (or fields) to the existing response. In my mind, when the connection is first made, as part of Destination.Get
we should include along with the service FQDN the (source) node identifier I previously mentioned.
In the response, we can add an additional topology
field that can be populated with the service preference and source node topology label. Each address can include as part of its labels its topology.
For example, based on the output structure of the test destination client, something like this could work:
Add:
labels: <addressSet-metrics-labels>
svcTopology: <service topology labels>
nodeTopology: <node topology labels>
- 172.17.0.19:8080
labels: <control_plane_ns, deployment (...), kubernetes.io/hostname, topology.zone>
What I am unsure of is the implications on performance.
You mention that weights should be added on the proxy side, what actually decides the weight itself?
I think according to the inner-workings of the service topology feature, the weight should be dependent on the preference listed in the spec. If the weights are to be calculated based on a cost function, it would make the most sense to me to have them calculated on the proxy side, otherwise, we would have to keep track of each endpoint that is added/removed. For example, if the topology preference in a service would be:
- host
- zone
and for that same service we have 2 endpoints on the same host and one endpoint in the same zone, then they would be calculated based on a cost function which would minimise in this case the network hops/distance. The weights would then be relative to:
a) the topology preference, like previously mentioned
b) the number of endpoints in each topology.
if we add another endpoint in the same host, all of the weights would have to be scaled to account for the change, which would probably be a bit harder to do on the destination service.
I'm not sure if I am overcomplicating this. As a result of my overcomplication, I'm having a bit of a difficulty in coming up with a way to assign weights.
If it's just a case of priorities, then the whole tasks becomes a bit simpler and can be done on the destination service side. Using the same example, all endpoints on the same host would have a priority of 1, same zone priority of 2.
As a side note, your question made me dig a bit deeper into how other service meshes handle load balancing based on locality.
What caught my eye was the priority level associated with each locality. I'm not sure if I'm on the right track here but it was related to what I understood the problem to be.
|
||
All meshed pods have a proxy that will first establish all outbound endpoints of a service the pod is targeting. Through this change, we will introduce a new weight to each endpoint based on locality, which can afterwards be used to add subsequent features that build on locality-awareness. I propose that the weights be added to each endpoint on the proxy side. To ensure backwards compatibility, there should be a separate gRPC call that can be made to the destination service to get a service's topology preference, and the host's (node) topology labels. The call would include the FQDN/IP of a service, and a topology identifier (similar to the context namespace used for service profiles). | ||
|
||
The destination service will resolve the FQDN/IP in its usual fashion; it will now also hold in cache the endpoint slices of the service (which have topology metadata). As part of every service that is cached, the destination service should add (if they exist) the topology keys in the order of preference of the owner. Based on the locality identifier of the source, the topology preference of the service, and the available endpoints, each endpoint can have a topology weight assigned to it. The proxy receives its endpoint(s) as usual and can assign weights based on the information it receives from the separate topology call and the topology metadata associated with each endpoint(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a handle on how EndpointSlice
will impact our discovery? What does the watch look like for the proxy and destination service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I currently understand, this is the whole process (including the watch) for service discovery:
Based on the FQDN/IP of the service the pod is trying to access, the proxy will build a tower stack for the outbound traffic. The stack includes the load balancing layer which wraps around a discovery crate and a resolver. A connection is established to the destination service via the discovery layer; the layer contains a buffer and a watchdog to poll for updates. (Apologies if so far any of this is wrong, still getting up to grips with Rust).
Meanwhile, in the destination service the server accepts calls from the proxy in the form of Destination.Get
. The service name is at first resolved, a set of weighted address set is sent back to the proxy. Every time a new address is discovered for a service, a new weighted set is created with its endpoint and sent to the proxy. All CRUD type updates are propagated from the destination service to the proxy.
The destination service actively watches and caches all services and endpoints (all except whatever is in kube-system ns). Each service keeps track of its ports by identifying them through their hostname and port number.
Endpoint slices differ from endpoints through a few spec changes:
- Hostnames are no longer included as part of an address entry, instead this has been replaced by the topology label map;
- Endpoints contain ALL network endpoints in a single resource, endpoint slices contain by default 100 addresses. Because of this, there is no longer a 1:1 relation between an endpoints resource and a service. Instead, a service can have many endpoint slices, so each endpoint slice name (id) is randomly generated based on the service name, the service it belongs to has to be inferred from its labels.
- Endpoint slices support dual-stack endpoints, so an address entry can, in theory, contain both an IPv4 and an IPv6.
If I understand the question correctly, by introducing EndpointSlice
most of the work on the discovery watcher will be geared around making it work with current implementation of the port/service publishers, beyond that at the moment I can't see anything else that will have to be changed, but in case I have it wrong I can keep on looking and explore it a bit more in-depth.
I haven't explored much (yet) how the portPublisher
will be affected by these changes, but my assumption is that the subscription methods will not be affected by these changes, and even though addresses are split between many resources it is still possible to consolidate them into a single portPublisher
when they are mapped to the same port.
tl;dr at the moment I think the impact will be mainly on integrating the EndpointSlices
discovery informer and its handler functions to work with all of the code in the endpoint_watcher
. Instead of a 1:1 mapping between svcs and endpoints we (can) now have many endpoint slices resources mapped to the same service so they would all have to be kept in sync and consolidated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Matei207 for this incredibly well thought out and well researched proposal! This is really fantastic work. I've proposed a simplification of the design; please take a look and let me know if that makes sense or if there are any complications that I have missed.
|
||
To implement support for this feature, my main assumption is that most of the work will be carried out in the `destination service` on the control plane side. Although the feature does not add a lot of changes in terms of the Kubernetes API, several changes have to be made. The first and most obvious to me is supporting the `endpointSlices` feature that was introduced in Kubernetes `v1.16`, since this will enable endpoints to hold topology metadata. | ||
|
||
All meshed pods have a proxy that will first establish all outbound endpoints of a service the pod is targeting. Through this change, we will introduce a new weight to each endpoint based on locality, which can afterwards be used to add subsequent features that build on locality-awareness. I propose that the weights be added to each endpoint on the proxy side. To ensure backwards compatibility, there should be a separate gRPC call that can be made to the destination service to get a service's topology preference, and the host's (node) topology labels. The call would include the FQDN/IP of a service, and a topology identifier (similar to the context namespace used for service profiles). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "weight" appears frequently in this RFC and I think it may help avoid confusion if we use the term "priority" instead. Linkerd already has the concept of endpoint weighting: endpoints with higher weights receive a higher proportion of traffic (this is used to do traffic splitting). In contrast, if we introduced topology aware routing and priorities based on the topology labels, traffic should be sent ONLY to the endpoints with the highest priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to arrive at the conclusion that weights should, in fact, be priorities, so I completely agree with this. My initial assumption was that we would have two separate sets of weights that would be used in the first instance to have a topological sort and then to do traffic splitting. After more research on the matter, weights would be unfeasible. I will make all necessary changes to reflect in the RFC that we are talking of priorities and not weights. Thank you for your comment it helped me dispel some doubts that I had!
[constraints]: #constraints | ||
--- | ||
|
||
* Both `service topology` and `endpoint slices` are not active in a cluster by default. Since they are alpha (and beta) features respectively, they need to be activated by a feature gate. Earlier in the design section I mentioned that we shouldn't use an endpoint watcher and an endpoint slices watcher at the same time. My suggestion is to include a feature flag in the control plane config map. We can access the config map when we create a new server, based on this, when we fire up the watchers we can choose based on the flag which watcher to create. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about doing something similar to this https://github.com/linkerd/linkerd2/blob/master/pkg/k8s/authz.go#L89 to determine if the EndpointSlices resource kind exists on the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the pointer! That's a sensible way to do it and won't introduce any configuration.
I think an EndpointSlices
watcher should be used instead of an Endpoints
watcher if the former is activated in a cluster. However, this approach has a major drawback if the Endpoints
are created manually for a service (without the use of a pod selector); in this case, the EndpointSlices
will not be created and it's possible the endpoints for a service will not be registered. My question is: would this edge case be of concern to us and do we need to account for it, or would it be up to the adopter to ensure the EndpointSlices
are created if the service does not include a pod selector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. I'd be curious to know how kube-proxy handles this situation. Does it watch both Endpoints and EndpointSlices and how does it reconcile differences between the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made all recommended changes to the RFC. I've made a note to look into this specific case and how the EndpointSlices
are implemented in the service topology PR, will report back once I know a bit more about it.
* **Service topology**: extend the current `servicePublisher` implementation to include the topology preference as described in the service spec. This implies making changes to the handler functions of the watcher. | ||
|
||
|
||
* **Adding weights to addresses**: to add weights to an address, we need access to the service topology, the source node (its topology labels) and the endpoint topology. If the source node name is sent by the proxy, the node's topology values can be trivially accessed using the `K8s API`. There are two edge cases that exist under my assumptions however, which makes weighting the addresses a bit harder than it should be on the destination service side. First, two source proxies can listen on the same service, because of this, I think it is infeasible to add the weighting logic on subscription. Second, we have to also consider updates -- while on update we can weight a endpoint based on the current endpoints that already exist in a stream, there is a high chance there might be no available endpoints, so inferring weights from existing endpoints when updating also does not sound feasible (even if there is a 1:1 relation between a source and a service stream). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I totally understand your points here about not being able to make topology decisions on subscription or update.
Here's my first instinct for how I would approach this:
If we add a nodeName
context token field to the Destination.Get method, then the destination controller should have all the information it needs to make topology aware routing decisions and return the correct list of endpoints to the proxy (i.e. the endpoints with the highest priority for that node).
I think this decision can be made in the EndpointTraslator. If the EndpointTraslator has the list of endpoints (including their topology metadata), the topology keys from the service, and the node name of the subscribing proxy, it can look up the topology labels for that node and use that to filter down the list of endpoints to only the ones that match the node's label for the first topology key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach simplifies a lot of things:
- the only proxy change is that it needs to send it's node name (discovered via the downward API and an env variable). the proxy does not need any topology aware routing logic since the destination controller will just give it the correct set of endpoints to route to.
- no need for a new separate API to discover topology keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adleong thank you so much for your comments! Disregard the inability to make topology decisions on subscription or update, that was before I realised weights
are to be priorities
. My whole design focused on the ability to assign weights on the fly based on availability and the topological domain; assigning priorities instead makes this redundant and I don't see any reason why it should be done on the proxy side, so thank you for clarifying.
It makes a lot of sense to change the proposal based on your comments and guidelines, and I do agree that it brings a major simplification on the design. I have started to re-work the proposal. There is one thing that I want to address before I push any changes, however.
filter down the list of endpoints to only the ones that match the node's label for the first topology key
I have thought about this and I realise it might sound a bit nitpicky, but what would happen if by chance all source pods would be placed on the same node whereas the destination service's pods are evenly spaced out in a cluster? I realise this might be a product of randomness and might not happen easily, but it could also be a product of affinity in certain cases.
Going on my previous example, if service A (whose pods are on the same node) calls service B (whose pods are evenly spaced out), then all of the calls would be routed to the same service B pod (?). Would it make sense to account for such cases by sending additional endpoints from the second topology key that matches the node's label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid concern but I think we should stick to the service topology specification: https://kubernetes.io/docs/concepts/services-networking/service-topology/
Traffic will be directed to a Node whose value for the first label matches the originating Node’s value for that label. If there is no backend for the Service on a matching Node, then the second label will be considered, and so forth, until no labels remain.
So, the case that you describe, all traffic from A to B should go to the B pod which is on the same node as the A pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that makes sense to me.
Replaced 'weight' with 'priority' to clear confusion, renamed file, added reviewers, RFC PR link, changed a major part on the proxy to reflect new ideas, and changed how 'weights' are thought and computed on the destination service to be priorities that are filtered in the listener Signed-off-by: Matei David <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fantastic!
Signed-off-by: Matei David <[email protected]>
[Link to RFC](linkerd/rfc#23) ### What --- * PR that puts together all past pieces of the puzzle to deliver topology-aware service routing, as specified in the [Kubernetes docs](https://kubernetes.io/docs/concepts/services-networking/service-topology/) but with a much better load balancing algorithm and all the coolness of linkerd :) * The first piece of this PR is focused on adding topology metadata: topology preference for services and topology `<k,v>` pairs for endpoints. * The second piece of this PR puts together the new context format and fetching the source node topology metadata in order to allow for endpoints filtering. * The final part is doing the filtering -- passing all of the metadata to the listener and on every `Add` filtering endpoints based on the topology preference of the service, topology `<k,v>` pairs of endpoints and topology of the source (again `<k,v>` pairs). ### How --- * **Collecting metadata**: - Services do not have values for topology keys -- the topological keys defined in a service's spec are only there to dictate locality preference for routing; as such, I decided to store them in an array, they will be taken exactly as they are found in the service spec, this ensures we respect the preference order. - For EndpointSlices, we are using a map -- an EndpointSlice has locality information in the form of `<k,v>` pair, where the key is a topological key (similar to what's listed in the service) and the value is the locality information -- e.g `hostname: minikube`. For each address we now have a map of topology values which gets populated when we translate the endpoints to an address set. Because normal Endpoints do not have any topology information, we create each address with an empty map which is subsequently populated ONLY for slices in the `endpointSliceToAddressSet` function. * **Filtering endpoints**: - This was a tricky part and filled me with doubts. I think there are a few ways to do this, but this is how I "envisioned" it. First, the `endpoint_translator.go` should be the one to do the filtering; this means that on subscription, we need to feed all of the relevant metadata to the listener. To do this, I created a new function `AddTopologyFilter` as part of the listener interface. - To complement the `AddTopologyFilter` function, I created a new `TopologyFilter` struct in `endpoints_watcher.go`. I then embedded this structure in all listeners that implement the interface. The structure holds the source topology (source node), a boolean to tell if slices are activated in case we need to double check (or write tests for the function) and the service preference. We create the filter on Subscription -- we have access to the k8s client here as well as the service, so it's the best point to collect all of this data together. Addresses all have their own topology added to them so they do not have to be collected by the filter. - When we add a new set of addresses, we check to see if slices are enabled -- chances are if slices are enabled, service topology might be too. This lets us skip this step if the latest version is not adopted. Prior to sending an `Add` we filter the endpoints -- if the preference is registered by the filter we strictly enforce it, otherwise nothing changes. And that's pretty much it. Signed-off-by: Matei David <[email protected]>
RFC for community-bridge application process, based on linkerd/linkerd2#4325.
Rendered
Signed-off-by: Matei David [email protected]