Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service topologies: topology-aware service routing #4780

Merged
merged 2 commits into from
Aug 18, 2020
Merged

service topologies: topology-aware service routing #4780

merged 2 commits into from
Aug 18, 2020

Conversation

mateiidavid
Copy link
Member

@mateiidavid mateiidavid commented Jul 22, 2020

Link to RFC
Updated (29 Jul): Initially, this PR was supposed to be only for adding topology preference to servicePublisher and for collecting topology information from slices. However, the PR (now marked as a draft) covers the prioritisation of endpoints for topology-aware service routing.

What


  • PR that puts together all past pieces of the puzzle to deliver topology-aware service routing, as specified in the Kubernetes docs but with a much better load balancing algorithm and all the coolness of linkerd :)
  • The first piece of this PR is focused on adding topology metadata: topology preference for services and topology <k,v> pairs for endpoints.
  • The second piece of this PR puts together the new context format and fetching the source node topology metadata in order to allow for endpoints filtering.
  • The final part is doing the filtering -- passing all of the metadata to the listener and on every Add filtering endpoints based on the topology preference of the service, topology <k,v> pairs of endpoints and topology of the source (again <k,v> pairs).

How


  • Collecting metadata:

    • Services do not have values for topology keys -- the topological keys defined in a service's spec are only there to dictate locality preference for routing; as such, I decided to store them in an array, they will be taken exactly as they are found in the service spec, this ensures we respect the preference order.

    • For EndpointSlices, we are using a map -- an EndpointSlice has locality information in the form of <k,v> pair, where the key is a topological key (similar to what's listed in the service) and the value is the locality information -- e.g hostname: minikube. For each address we now have a map of topology values which gets populated when we translate the endpoints to an address set. Because normal Endpoints do not have any topology information, we create each address with an empty map which is subsequently populated ONLY for slices in the endpointSliceToAddressSet function.

  • Filtering endpoints:

    • This was a tricky part and filled me with doubts. I think there are a few ways to do this, but this is how I "envisioned" it. First, the endpoint_translator.go should be the one to do the filtering; this means that on subscription, we need to feed all of the relevant metadata to the listener. To do this, I created a new function AddTopologyFilter as part of the listener interface.

    • To complement the AddTopologyFilter function, I created a new TopologyFilter struct in endpoints_watcher.go. I then embedded this structure in all listeners that implement the interface. The structure holds the source topology (source node), a boolean to tell if slices are activated in case we need to double check (or write tests for the function) and the service preference. We create the filter on Subscription -- we have access to the k8s client here as well as the service, so it's the best point to collect all of this data together. Addresses all have their own topology added to them so they do not have to be collected by the filter.

    • When we add a new set of addresses, we check to see if slices are enabled -- chances are if slices are enabled, service topology might be too. This lets us skip this step if the latest version is not adopted. Prior to sending an Add we filter the endpoints -- if the preference is registered by the filter we strictly enforce it, otherwise nothing changes.

And that's pretty much it.

Points of concern


  • I have been thinking about this for a while, I think this architectural choice should work fine but is there any cause of concern when it comes to scaling? What would happen when a cluster has 20 services with 5 EndpointSlices each, would adding extra metadata (topology map) degrade performance? I would appreciate some rigid feedback in this area to figure out if this is the right choice.

  • Again, related to the architecture, I am not sure what other ways we have to collect all of the data together other than having a dedicated structure or storing the data in the listener structure. There are a couple of pain points here, we could pass everything as an argument to Add but there is no point where we will send an Add AND we have access to the k8s client and service in question. We could initialise a client and get all of this data but I find that it will slow everything down. Doing everything in Subscribe to me seems to be the simpler and more efficient way, but as always I look forward to finding a better and cleaner way to do this.

  • Finally, I left some scattered comments throughout the PR -- this is intentional, there are alternative concerns over performance or behaviour that I have. Will clean them up when the draft is in a good state, feel free to ignore them or debate them, wanted to show what my reasoning is or ask indirect questions of the reviewer.

  • At the moment, if a service changes its topology, in order for it to be enforced all pods have to be re subscribed (so deleted). This isn't good (?).

Tests


  • Testing is not very straightforward so, I'll detail how I tested everything. I'm collating more cases, for now I have just tested the preference is enforced.

  • To test this, minikube won't help much -- we need at least two nodes with different hostnames, regions and zones. Kind was the best option for me, since it's bootstrapped with kubeadm -- you can label each node as you want on creation so you can test multiple scenarios. You can find my kind config here.

  • Next, I decided to use emojivoto as my test services. web-svc feeds into emoji-svc, so to test topology I edited emoji deployment: scaled replicas to 2 (I only used 2 nodes, 3+ kind nodes make my laptop take off) -- there is no need for affinity in this case because Kubernetes will automatically spread for resilience. For 3 nodes, scale to 3 replicas etc. You can use anti affinity to space the pods out if you wish. Next is adding the topological preference to the service, in emoji-svc add :

topologyKeys:
  - "kubernetes.io/hostname"
  - <other keys>

Supported labels are ["kubernetes.io/hostname", "topology.kubernetes.io/region", "topology.kubernetes.io/zone","*"]

Note: * has to go last. You can re-edit topologies! To change preference you must first delete the topology and then re-write it. This was the case with me; I edited the service using kubectl edit though.

  • Final step is get the logs for the destination service, to kick everything into gear I deleted the web pod to force re subscriptions. I have added quite a few debug statements so you can observe what's going on -- these will not stay in but it helps to figure what the call trace is like. Make sure you start the controllers with debug log level.

For my test, this is what we get for 2 replicas of emoji spread across 2 nodes, and 1 of webconnecting to a topology of node local:

time="2020-07-29T16:11:10Z" level=info msg="Establishing endpoint topology filter with preference: [kubernetes.io/hostname]" addr=":8086" component=endpoints-watcher
time="2020-07-29T16:11:10Z" level=debug msg="Adding topology filter with preference [kubernetes.io/hostname] and node topology map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]" addr=":8086" component=endpoint-translator remote="127.0.0.1:37800" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-07-29T16:11:10Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname]" addr=":8086" component=endpoint-translator remote="127.0.0.1:37800" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-07-29T16:11:10Z" level=debug msg="Locality [kubernetes.io/hostname]" addr=":8086" component=endpoint-translator remote="127.0.0.1:37800" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-07-29T16:11:10Z" level=debug msg="Address [10.244.2.29:emoji-7dc976587b-qfgfr]: Topology is map[kubernetes.io/hostname:kind-worker topology.kubernetes.io/region:west topology.kubernetes.io/zone:west-1a] " addr=":8086" component=endpoint-translator remote="127.0.0.1:37800" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-07-29T16:11:10Z" level=debug msg="Address [10.244.1.29:emoji-7dc976587b-nb2rc]: Topology is map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c] " addr=":8086" component=endpoint-translator remote="127.0.0.1:37800" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-07-29T16:11:10Z" level=debug msg="Filtered 1 from a total of 2" addr=":8086" component=endpoint-translator remote="127.0.0.1:37800" service="emoji-svc.emojivoto.svc.cluster
time="2020-07-29T16:11:10Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763229}  port:8080}  weight:10000  metric_labels:{key:\"control_plane_ns\"  value:\"linkerd\"}  metric_labels:{key:\"deployment\"  value:\"emoji\"}  metric_labels:{key:\"pod\"  value:\"emoji-7dc976587b-nb2rc\"}  metric_labels:{key:\"pod_template_hash\"  value:\"7dc976587b\"}  metric_labels:{key:\"serviceaccount\"  value:\"emoji\"}  tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}}  protocol_hint:{h2:{}}}  metric_labels:{key:\"namespace\"  value:\"emojivoto\"}  metric_labels:{key:\"service\"  value:\"emoji-svc\"}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:37800" service="emoji-svc.emojivoto.svc.cluster.local:8080"

To show it also works as normal with no pref specified, I edited the svc, deleted the pref, re-started web pod:

time="2020-07-29T16:16:41Z" level=info msg="Establishing endpoint topology filter with preference: []" addr=":8086" component=endpoints-watcher
time="2020-07-29T16:16:41Z" level=debug msg="Adding topology filter with preference [] and node topology map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]" addr=":8086" component=endpoint-translator remote="127.0.0.1:47182" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-07-29T16:16:41Z" level=debug msg="Filtering through address set with preference []" addr=":8086" component=endpoint-translator remote="127.0.0.1:47182" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-07-29T16:16:41Z" level=debug msg="Did not filter!" addr=":8086" component=endpoint-translator remote="127.0.0.1:47182" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-07-29T16:16:41Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763485}  port:8080}  weight:10000  metric_labels:{key:\"control_plane_ns\"  value:\"linkerd\"}  metric_labels:{key:\"deployment\"  value:\"emoji\"}  metric_labels:{key:\"pod\"  value:\"emoji-7dc976587b-qfgfr\"}  metric_labels:{key:\"pod_template_hash\"  value:\"7dc976587b\"}  metric_labels:{key:\"serviceaccount\"  value:\"emoji\"}  tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}}  protocol_hint:{h2:{}}}  addrs:{addr:{ip:{ipv4:183763229}  port:8080}  weight:10000  metric_labels:{key:\"control_plane_ns\"  value:\"linkerd\"}  metric_labels:{key:\"deployment\"  value:\"emoji\"}  metric_labels:{key:\"pod\"  value:\"emoji-7dc976587b-nb2rc\"}  metric_labels:{key:\"pod_template_hash\"  value:\"7dc976587b\"}  metric_labels:{key:\"serviceaccount\"  value:\"emoji\"}  tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}}  protocol_hint:{h2:{}}}  metric_labels:{key:\"namespace\"  value:\"emojivoto\"}  metric_labels:{key:\"service\"  value:\"emoji-svc\"}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:47182" service="emoji-svc.emojivoto.svc.cluster.local:8080"

@mateiidavid mateiidavid requested a review from a team as a code owner July 22, 2020 17:22
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me but I'd like to see it in the context of the full topology aware routing feature. It's a small enough change that we might consider just rolling it into a more feature complete PR rather than reviewing and merging it on its own. What do you think?

@mateiidavid mateiidavid marked this pull request as draft July 23, 2020 13:05
@mateiidavid
Copy link
Member Author

This looks good to me but I'd like to see it in the context of the full topology aware routing feature. It's a small enough change that we might consider just rolling it into a more feature complete PR rather than reviewing and merging it on its own. What do you think?

I think this sounds perfect. I've marked this PR as a draft, once #4771 is merged I can start adding more to this PR and discuss things further.

@mateiidavid mateiidavid changed the title service topologies: add topology labels to services and endpoints service topologies: topology-aware service routing Jul 29, 2020
@mateiidavid
Copy link
Member Author

mateiidavid commented Jul 29, 2020

Updated!

@adleong if you'd be so kind 😄

also, unit tests are failing because I updated rbac for dest service to be able to get nodes. won't update the golden files because the diff will get spammed, can do it after the poc is done.

Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see this coming together!

I have a suggestion for a restructuring of this that I think could simplify things. What if, instead:

  • the endpointTranslator struct holds the topology map of the source node. In other words newEndpointTranslator would add a k8s client parameter and a nodeName parameter, look up the topology label map for that node, and store the resulting map in the endpointTranslator struct. I think it is a fair assumption that these node labels will not change during the lifetime of the node.
  • Add the array of topology key preferences to the AddressSet struct. I think this makes sense because these preferences come from the service and this will make our life easier.
  • This allows the endpoints watcher to be totally unaware of the filtering entirely and the endpoint translator has all the information it needs to do filtering inline. It knows the source nodes topology values, it knows the topology key preference from the Added AddressSet, and it knows the topology values for each address.
  • This lets us get rid of the TopologyFilter type and the AddTopologyFilter function.

Here's where things get complicated, though:

Adding filtering introduces some new complexity to our diff-based (Add, Remove) API. Let me illustrate with an example: suppose that you're subscribed to a service with 1 endpoint local to your node and 3 endpoints on other nodes in your zone. If your topology preference says that you prefer node then zone, then the Destination API will return you an Add with the 1 node local endpoint. If that endpoint is later deleted, however, the endpoint watcher will call Remove to remove this endpoint, but we also need to somehow trigger an Add of the 3 zone local endpoints in order to fall back to these once no node local endpoints exist.

This is tricky. One approach would be to have the endpoint translator become more statful and keep track of the total set of endpoints it knows about AND the filtered set of endpoints it has told the client about. When it receives an update (an Add, Remove, or NoEndpoints), it does the following:

  1. Updates the total set of endpoints based on this update
  2. Using the topology preference, calculates the filtered set of endpoints from the total set
  3. Diffs the calculated filtered set against the previously stored filtered set
  4. Transmits this diff to the client and updates the stored filtered set

Let me know what you think.

}

// If we reach this point it means we either have no satisfied topologies
// or we encountered the wildcard. Return in both cases what we previously had
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we reach the end of the list but do not encounter the wildcard, we should return NoEndpoints rather than the full set.

If this field is specified and all entries have no backends that match the topology of the client, the service has no backends for that client and connections should fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very valid point of view. I thought similarily but I didn't see any easy way of sending a NoEndpoints message since the filtering happens in Add based on how I have done things. I think in this scenario we could either have the filtering done before calling Add or return an empty set from the filtering method. Hope I didn't miss anything.

@@ -109,7 +109,13 @@ func (s *server) Get(dest *pb.GetDestination, stream pb.Destination_GetServer) e
return status.Errorf(codes.InvalidArgument, "Invalid authority: %s", dest.GetPath())
}

var token contextToken
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we move this block up, we can pass the node name into the newEndpointTranslator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great!

@@ -49,6 +51,7 @@ type (
OwnerKind string
Identity string
AuthorityOverride string
Topology map[string]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Topology is a slightly weird name for this field since these are more like topology coordinates. Maybe "TopologyLabels"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will switch it around.

@mateiidavid
Copy link
Member Author

@adleong thank you for your thoroughness, as always. Your observations are excellent and I don't have much to add to it.

* the endpointTranslator struct holds the topology map of the source node.  In other words `newEndpointTranslator` would add a k8s client parameter and a nodeName parameter, look up the topology label map for that node, and store the resulting map in the endpointTranslator struct. 

I thought of the same approach at first but decided to go with the filter structure. I didn't know if it would be preferred to make another client and keep the node information in the translator -- I took it as a separation of concerns sort of situation where the listener should not know anything about the k8s API (incl clientset & idea of a node) and just deal with diffing addresses. On the other hand, having the concept of a filter to me seemed like it still enforces this "separation". Seeing as there is no cause for concern I'm more than happy to change it though, I think it's a better idea!

I think it is a fair assumption that these node labels will not change during the lifetime of the node.

Precisely. I think the only way this would be affected is if the labels are manually changed.

* Add the array of topology key preferences to the `AddressSet` struct.  I think this makes sense because these preferences come from the service and this will make our life easier.

...I completely missed this 🤕. An amazing suggestion I think.

* This allows the endpoints watcher to be totally unaware of the filtering entirely and the endpoint translator has all the information it needs to do filtering inline.  It knows the source nodes topology values, it knows the topology key preference from the Added AddressSet, and it knows the topology values for each address.

* This lets us get rid of the TopologyFilter type and the AddTopologyFilter function.

Here's where things get complicated, though:

Adding filtering introduces some new complexity to our diff-based (Add, Remove) API. Let me illustrate with an example: suppose that you're subscribed to a service with 1 endpoint local to your node and 3 endpoints on other nodes in your zone. If your topology preference says that you prefer node then zone, then the Destination API will return you an Add with the 1 node local endpoint. If that endpoint is later deleted, however, the endpoint watcher will call Remove to remove this endpoint, but we also need to somehow trigger an Add of the 3 zone local endpoints in order to fall back to these once no node local endpoints exist.

I think that I really missed how simple this could have been done and for sure needlessly complicated everything.

then the Destination API will return you an Add with the 1 node local endpoint. If that endpoint is later deleted, however, the endpoint watcher will call Remove to remove this endpoint, but we also need to somehow trigger an Add of the 3 zone local endpoints in order to fall back to these once no node local endpoints exist.

Hm, this is a design choice that we haven't really covered. I think my initial assumption was a bit different.

  1. based on your scenario, my understanding is that when we have a list of preferences, we Add the endpoints for the first preference -- should the endpoint(s) fail, and no remaining endpoints satisfy our preference, we fall back on the next.

  2. my assumption was a more contract-oriented way (using the word contract very loosely) -- once our preference is satisfied we only stick to that preference, if the endpoints go down we don't fall back on the next preference.

I suppose the second approach has the advantage of fewer side effects since we know the preference that has been satisfied will be enforced throughout the duration of the pod's life (as an ephemeral object this shouldn't an issue, pods are short-lived). This invites a bit more interaction from the operator; in the case that the first preference has endpoints come up you have to delete the (source) pods to make the subscription work again. This also means less code complexity and (what I completely assume to be) better performance (??). The simplicity here stems from having to deal with no alternatives -- for example, what happens if in the first instance we fallback to our second choice but then the first choice magically has a pod spin up -- do we terminate the connection and switch back to the first preference?

The first approach that you have described is what I like best. It comes closer to what I thought this feature to be. Would we only fallback or also switch back to the first preference (like described above)? I think the main advantage of this is that it provides actual topology-aware routing, fallback mechanisms and just overall makes everything better and more fun. The main downside I can think of is having the listener be more stateful, there are more things that can potentially go wrong.

This is all just at first glance, I'm operating on surface knowledge here so not sure if I'm right about it, but thought it'd be cool to talk about it.

This is tricky. One approach would be to have the endpoint translator become more statful and keep track of the total set of endpoints it knows about AND the filtered set of endpoints it has told the client about. When it receives an update (an Add, Remove, or NoEndpoints), it does the following:

1. Updates the total set of endpoints based on this update

2. Using the topology preference, calculates the filtered set of endpoints from the total set

3. Diffs the calculated filtered set against the previously stored filtered set

4. Transmits this diff to the client and updates the stored filtered set

Let me know what you think.

To be honest, I think this is a very good way to do it. I will make all of the changes so you can see how it would look like in practice and we can take it from there? Perhaps more things will become more apparent once the code and my inconsistencies are out in public view, heh.

@adleong adleong self-assigned this Jul 30, 2020
@mateiidavid
Copy link
Member Author

mateiidavid commented Aug 5, 2020

@adleong (and whoever else feels like going through this PR) I have compiled a list of basic scenarios to ensure locality and fallback functionality are enforced when filtering. I have also uncovered a nasty bug where on update, slices wouldn't have any topology preferences assigned to them. The issue here was that the set in Add(set) came from the diff function, so I did not pass the topology pref to that particular set.

Tests:


When testing, I sought to:

  • ensure diff based API works (endpoint_translator.go)
  • ensure locality is enforced (i.e service topology)
  • ensure fallback works as expected (falling back, switching back, etc. check Alex's post above where he describes this w/ a scenario)
  • ensure wildcard is enforced & NoEndpoints sent when no suitable preference is found.

For these tests scenarios, my preference has been one of the three:

  1. [kubernetes.io/hostname]
  2. [kubernetes.io/hostname, topology.kubernetes.io/region]
  3. [kubernetes.io/hostname, *]
  • For a quick kind configuration for node labels (topology-labels) check the main post.
  • To "simulate" different spreads I used affinity, here are two helpful snippets to force pods to have a regional spread and to force all pods to be local:
# add this to deployment (I used emoji-svc which is accessed by web pods)

# affinity to force local, replace kind-worker2 with your preferred node
       affinity:
         nodeAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
             nodeSelectorTerms:
             - matchExpressions:
               - key: kubernetes.io/hostname
                 operator: In
                 values:
                 - kind-worker2

# affinity to allow local + region
# replace kind-worker2 with regional node
# replace kind-worker with local node

       affinity:
         nodeAffinity:
           preferredDuringSchedulingIgnoredDuringExecution:
           - preference:
               matchExpressions:
               - key: kubernetes.io/hostname
                 operator: In
                 values:
                 - kind-worker
             weight: 1
           - preference:
               matchExpressions:
               - key: kubernetes.io/hostname
                 operator: In
                 values:
                 - kind-worker2
             weight: 99

# Note regional and local are according to your web pod(s) if using emoji svc

Test scenarios


Based on the above, here are my scenarios (along w/ logs)

  1. Prefer local
    1.a. on subscribe, for a set of endpoints {A:local,B:local,C:regional}
    translator recv: Add(A,B,C)
    client recv: Add(A,B)
		time="2020-08-05T16:08:03Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker2" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:08:03Z" level=debug msg="Considering address [10.244.1.68] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:08:03Z" level=debug msg="Considering address [10.244.1.70] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:08:03Z" level=debug msg="Considering address [10.244.2.74] with topology labels [map[kubernetes.io/hostname:kind-worker topology.kubernetes.io/region:west topology.kubernetes.io/zone:west-1a]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:08:03Z" level=debug msg="Filtered 2 from a total of 3" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
*Send dest add*

NAME                        READY   STATUS        RESTARTS   AGE   IP            NODE           NOMINATED NODE   READINESS GATES
emoji-788fdcd8-2h846        2/2     Running       0          63m   10.244.2.74   kind-worker    <none>           <none>
emoji-788fdcd8-fcmss        2/2     Running       0          65m   10.244.1.70   kind-worker2   <none>           <none>
emoji-788fdcd8-j5wqv        2/2     Running       0          65m   10.244.1.68   kind-worker2   <none>           <none>

1.b remove a local endpoint from {A, B, C}
translator recv: Remove(B)
client recv: Remove(B)

time="2020-08-05T16:11:03Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname]" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker2" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Considering address [10.244.1.70] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Considering address [10.244.2.74] with topology labels [map[kubernetes.io/hostname:kind-worker topology.kubernetes.io/region:west topology.kubernetes.io/zone:west-1a]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Filtered 1 from a total of 2" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763268} port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"

# (Note: filtered 1 from a total of 2 because at the start, we remove B from the list of total endpoints which is used for filtering)

1.c add a new local endpoint to {A,C}
translator recv: Add(X)
client recv: Add(X)

  time="2020-08-05T16:19:18Z" level=debug msg="Updating EndpointSlice for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-05T16:19:18Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname]" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker2" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Considering address [10.244.1.73] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Considering address [10.244.1.74] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Considering address [10.244.2.74] with topology labels [map[kubernetes.io/hostname:kind-worker topology.kubernetes.io/region:west topology.kubernetes.io/zone:west-1a]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Considering address [10.244.1.72] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Filtered 3 from a total of 4" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763274} port:8080}

NAME                        READY   STATUS        RESTARTS   AGE   IP            NODE           NOMINATED NODE   READINESS GATES
emoji-785f4cbb66-g29kx      2/2     Running       0          47s   10.244.1.72   kind-worker2   <none>           <none>
emoji-785f4cbb66-hmkxr      2/2     Running       0          38s   10.244.1.73   kind-worker2   <none>           <none>
emoji-785f4cbb66-sbjpz      2/2     Running       0          26s   10.244.1.74   kind-worker2   <none>           <none>
emoji-788fdcd8-2h846        2/2     Terminating   0          74m   10.244.2.74   kind-worker    <none>           <none>
emoji-788fdcd8-fcmss        1/2     Terminating   0          76m   10.244.1.70   kind-worker2   <none>           <none>

# (Note: for the last one used affinity which is why we had so many pods; without affinity the pod kept popping up on worker2)
  1. Prefer local but fallback to region
    To have this work, update service topology to [local, region] then add affinity rules as described at the top of this post.
    2.a remove the local endpoint from {A: local, B: region, C: region}
    translator recv: Remove(A)
    client recv: Remove(A), Add(B,C)
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
emoji-78ffc6d99f-2zmdp      2/2     Running   0          3m5s    10.244.1.80   kind-worker2   <none>           <none>
emoji-78ffc6d99f-548pg      2/2     Running   0          3m17s   10.244.2.81   kind-worker    <none>           <none>
emoji-78ffc6d99f-86wq6      2/2     Running   0          2m51s   10.244.1.81   kind-worker2   <none>           <none>
vote-bot-6b5f6d9b8f-xbpms   2/2     Running   0          112m    10.244.1.67   kind-worker2   <none>           <none>
voting-755f879fd7-99tpt     2/2     Running   0          112m    10.244.1.69   kind-worker2   <none>           <none>
web-76868b75cb-m8594        2/2     Running   0          13m     10.244.2.79   kind-worker    <none>           <none>

time="2020-08-05T16:56:18Z" level=debug msg="Updating EndpointSlice for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-05T16:56:18Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname topology.kubernetes.io/region]" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Considering address [10.244.1.80] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Considering address [10.244.1.81] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Preferring endpoints with same topology topology.kubernetes.io/region as source west" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Considering address [10.244.1.81] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Considering address [10.244.1.80] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Filtered 2 from a total of 2" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763281} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-86wq6\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}} addrs:{addr:{ip:{ipv4:183763280} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-2zmdp\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763537} port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"

2.b add new local endpoint and switch traffic to local instead of region in {B: region, C: region}
translator recv: Add(X: local)
client recv: Remove(B,C) add (X)
This was done automatically as soon as the other case ended, since I deleted the pod and a new one came up

time="2020-08-05T16:56:33Z" level=debug msg="Updating EndpointSlice for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-05T16:56:33Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname topology.kubernetes.io/region]" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Considering address [10.244.1.80] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Considering address [10.244.1.81] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Considering address [10.244.2.82] with topology labels [map[kubernetes.io/hostname:kind-worker topology.kubernetes.io/region:west topology.kubernetes.io/zone:west-1a]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Filtered 1 from a total of 3" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763538} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-nt79c\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763280} port:8080} addrs:{ip:{ipv4:183763281} port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"

NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
emoji-78ffc6d99f-2zmdp      2/2     Running   0          6m12s   10.244.1.80   kind-worker2   <none>           <none>
emoji-78ffc6d99f-86wq6      2/2     Running   0          5m58s   10.244.1.81   kind-worker2   <none>           <none>
emoji-78ffc6d99f-nt79c      2/2     Running   0          2m44s   10.244.2.82   kind-worker    <none>           <none>
vote-bot-6b5f6d9b8f-xbpms   2/2     Running   0          115m    10.244.1.67   kind-worker2   <none>           <none>
voting-755f879fd7-99tpt     2/2     Running   0          115m    10.244.1.69   kind-worker2   <none>           <none>
web-76868b75cb-m8594        2/2     Running   0          16m     10.244.2.79   kind-worker    <none>           <none>
  1. NoEndpoints -- this works well when endpoints don't fit preference but does not seem to work correctly all the time. Mostly it sends a remove when deleting the only viable endpoint but never a NoEndpoints. Needs a closer look on my side

Update: NoEndpoints is no longer necessary. A NoEndpoints message tells the proxy to remove all endpoints, fortunately, the new diff based API removes all stale endpoints according to preference anyway.

  1. Use wildcard *
    Add topology [local, *]
    When deleting the local pod (same as 2.a) then we will get the rest of the pods available regardless of their locality
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
emoji-78ffc6d99f-2zmdp      2/2     Running   0          24m     10.244.1.80   kind-worker2   <none>           <none>
emoji-78ffc6d99f-6f8dj      2/2     Running   0          2m13s   10.244.2.84   kind-worker    <none>           <none>
emoji-78ffc6d99f-86wq6      2/2     Running   0          23m     10.244.1.81   kind-worker2   <none>           <none>
vote-bot-6b5f6d9b8f-xbpms   2/2     Running   0          133m    10.244.1.67   kind-worker2   <none>           <none>
voting-755f879fd7-99tpt     2/2     Running   0          133m    10.244.1.69   kind-worker2   <none>           <none>
web-76868b75cb-m8594        2/2     Running   0          34m     10.244.2.79   kind-worker    <none>           <none>


time="2020-08-05T17:17:41Z" level=debug msg="Updating EndpointSlice for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-05T17:17:41Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname *]" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Considering address [10.244.1.80] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Considering address [10.244.1.81] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763280} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-2zmdp\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}} addrs:{addr:{ip:{ipv4:183763281} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-86wq6\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763540} port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"

# Note: no debug message for when we choose *, this works removed deleted pod and added 2 pods (both on kind-worker2)

@mateiidavid mateiidavid requested a review from adleong August 5, 2020 18:06
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic. Going to do some manual testing...

controller/api/destination/endpoint_translator.go Outdated Show resolved Hide resolved
controller/api/destination/endpoint_translator.go Outdated Show resolved Hide resolved
controller/api/destination/endpoint_translator.go Outdated Show resolved Hide resolved
controller/api/destination/endpoint_translator.go Outdated Show resolved Hide resolved
controller/api/destination/endpoint_translator.go Outdated Show resolved Hide resolved
@mateiidavid
Copy link
Member Author

@adleong resolved feedback comments, changed some unmentioned code comments for clarity and moved function blocks around to make the file more readable. Let me know what you think.

@zaharidichev zaharidichev self-requested a review August 7, 2020 10:35
Copy link
Member

@zaharidichev zaharidichev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really fantastic works! Great feedback from @adleong. Will do some manual testing as well. +1 for the testing instructions!

@mateiidavid
Copy link
Member Author

Update 11th of August: I have updated the test fixtures so unit tests can pass (modified RBAC). I spoke briefly to @zaharidichev today and thought it would be a good idea to get this out of the draft stage.

@adleong and @zaharidichev please let me know when you have time how the testing went and if there is anything that should be changed for this PR!

@mateiidavid mateiidavid marked this pull request as ready for review August 11, 2020 13:15
@adleong
Copy link
Member

adleong commented Aug 11, 2020

It doesn't seem like endpoint slices can be enabled during an upgrade:

bin/linkerd --context=kind-kind upgrade --enable-endpoint-slices=true --controller-log-level=debug | grep "endpointSliceEnabled"
    {"linkerdNamespace":"linkerd","cniEnabled":false,"version":"git-68d7c62b","identityContext":{"trustDomain":"cluster.local","trustAnchorsPem":"-----BEGIN CERTIFICATE-----\nMIIBvzCCAWagAwIBAgIQdU7b86RL5QlGUvONIdoMKTAKBggqhkjOPQQDAjApMScw\nJQYDVQQDEx5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwHhcNMjAwNTE1\nMjEwNzQyWhcNMzAwNTEzMjEwNzQyWjApMScwJQYDVQQDEx5pZGVudGl0eS5saW5r\nZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAAQZ+JdN\ng0VpwxYdfzuxJKA0rw3GOT71Ywf5zy4boMlGvvPoUzme31yxPD0H0oSJcC7VRuJI\nP4QEj/snUSpyIgVBo3AwbjAOBgNVHQ8BAf8EBAMCAQYwEgYDVR0TAQH/BAgwBgEB\n/wIBATAdBgNVHQ4EFgQUKa3lmDnBeD2Jd2RZAXrTtIuL76UwKQYDVR0RBCIwIIIe\naWRlbnRpdHkubGlua2VyZC5jbHVzdGVyLmxvY2FsMAoGCCqGSM49BAMCA0cAMEQC\nIFbNkryVCJOikJnpXJXmo6msX9i84bwzyxq3XuPQ7ybVAiA7eKeksZCCqUmvoLss\nKwkCnm9z772sm+5zaHJ3ewIcIQ==\n-----END CERTIFICATE-----\n","issuanceLifetime":"86400s","clockSkewAllowance":"20s","scheme":"linkerd.io/tls"},"autoInjectContext":null,"omitWebhookSideEffects":false,"clusterDomain":"cluster.local","endpointSliceEnabled":false}

@adleong
Copy link
Member

adleong commented Aug 12, 2020

After playing with this for a little bit it seems to work great! The only issue I was able to find was that if you update a service and change its topology preference, that change doesn't take effect. I think this is because when we update a service in the endpointsWatcher, we don't propagate the topology preferences from the service to the current pp.addresses. Additionally, I think we would need to issue an empty Add to the endpoint translator to get it to pick up the new preferences.

@mateiidavid
Copy link
Member Author

@adleong thank you for both of your comments!

I have opened up a PR to resolve the upgrade issue.

For the service update in the watcher, I will have a closer look and try to solve it asap. I made the mistake of thinking that when we update a service we automatically update all ports (which in turn re adds all of the slices => new pp.addresses). I was wrong though because the updates happen only when the port is changed in the service.

I suppose we would need to send an empty Add to re-trigger the filtering rather than update the preference? Either way, I think sending an empty Add with the updated preference is less of an expensive operation compared to re-adding all of the slices in.

@mateiidavid
Copy link
Member Author

@adleong @zaharidichev I have pushed some new changes to fix the issue with services not updating. I would appreciate it if you could give me some feedback in terms of the code quality & style, as well as running through some test scenarios to ensure this works as expected.

  • Note: for tests, if you are using kind and replicating the scenario(s) detailed below, you will need to edit one of the nodes to make sure both (or all) worker nodes belong to the same region.

  • Note 2: commits are squashed now :) thought it was about time to clean up all those messy commit messages.

Changes


I have made two-three changes that I will detail here just so you are aware of what my rationale has been.

  • endpoints_watcher.go: when we update a service, for each portPublisher in that structure, we first make sure that it holds the latest preference. Each port then updates its respective AddressSets, done through the new port.updateTopologyPreference() method. When we update the topology preference, for all listeners of a portPublisher we send an empty Add with a newly created AddressSet that holds the latest topology pref but no actual addresses or labels.

  • endpoints_translator.go:

      1. I moved up the blocks that add/remove endpoints from the set of total available endpoints. This is done to make sure that if we start with no topological preference in a service, when we update the topology, we can filter enpoints. Put plainly, if the entire time we do not hold any state in the translator and then switch to filtering based on the current state -- well there won't be any state to filter on.
      1. I added a bit more logic to the if block related to no topological preferences in the set to be added. Before, if the set would have no topology => send all of them. This is not enough for updates, however, since a service that previously held a preference will send an empty add to the client but will not update the endpoints that have been sent until a slice is edited.
        E.g from a total set [A,B,C], if we previously had only sent A due to topology, then once we switch to no pref, the client will still only have A as an endpoint until the slice is edited or a new slice is added.
        To address this, we now check to see how many addresses the set to be added has -- when it has more than 0, we treat it like we used to. If it has 0 addresses it means its a topology update, so we diff based on the previously filtered endpoints and only send endpoints that have not been previously sent. This is again to make sure that if we go from a preference to none, all endpoints are made known to the client, so we simply send all of the endpoints that had previously been filtered.

Tests:


  • Caveat: to change a service's topology preference, I had to first delete the existing topology (e.g from local to no_pref) and then edit the service again and add the new keys. If I simply edited the service, I got this error:
A copy of your changes has been stored to "/var/folders/4j/vjq4hnmn6tlf89n_kl55c9zh0000gn/T/kubectl-edit-wv0fo.yaml"
error: unable to find api field in struct ServiceSpec for the json field "topologyKeys"

Not sure if this is because I edited with kubectl edit svc instead of kubectl apply -f

  • To test I ran a long scenario where we go from no_pref to region to local and finally to no_pref again without deleting any pods or forcing the slice itself to update.

Initial pod state

# Pods
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
emoji-b79947575-2c54l       2/2     Running   0          3m24s   10.244.1.95   kind-worker2   <none>           <none>
emoji-b79947575-b7xb2       2/2     Running   0          2m52s   10.244.1.97   kind-worker2   <none>           <none>
emoji-b79947575-f9s65       2/2     Running   0          45s     10.244.2.74   kind-worker    <none>           <none>
vote-bot-6cf96b6b76-nnql2   2/2     Running   0          3m24s   10.244.2.73   kind-worker    <none>           <none>
voting-79f4b5ccd4-5lt2f     2/2     Running   0          3m24s   10.244.2.72   kind-worker    <none>           <none>
web-66689dbb48-6mfw2        2/2     Running   0          3m24s   10.244.1.94   kind-worker2   <none>           <none>
  • Based on our pod state, we have:
    • 3 regional -- emoji pods
    • 2 local -- emoji pods on kind-worker2

The pods are local and regional to the src pod web-66689dbb48-6mfw2 running on kind-worker2

From no_pref to [region] --> because we go from nothing ever having been filtered, we re-send an add here. I think the easiest way to fix this would be to add the set to the filterSnapshot if we don't have a topology pref. Let me know and will add it in asap if this extra add is not preferred.

time="2020-08-13T13:53:09Z" level=debug msg="Updating service for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-13T13:53:09Z" level=debug msg="Filtering through address set with preference [topology.kubernetes.io/region]" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-13T13:53:09Z" level=debug msg="Filtered 3 from a total of 3" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-13T13:53:09Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763297}  port:8080}  weight:10000  metric_labels:{key:\"control_plane_ns\"  value:\"linkerd\"}  metric_labels:{key:\"deployment\"  value:\"emoji\"}  metric_labels:{key:\"pod\"  value:\"emoji-b79947575-b7xb2\"}  metric_labels:{key:\"pod_template_hash\"  value:\"b79947575\"}  metric_labels:{key:\"serviceaccount\"  value:\"emoji\"}  tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}}  protocol_hint:{h2:{}}}  addrs:{addr:{ip:{ipv4:183763295}  port:8080}  weight:10000  metric_labels:{key:\"control_plane_ns\"  value:\"linkerd\"}  metric_labels:{key:\"deployment\"  value:\"emoji\"}  metric_labels:{key:\"pod\"  value:\"emoji-b79947575-2c54l\"}  metric_labels:{key:\"pod_template_hash\"  value:\"b79947575\"}  metric_labels:{key:\"serviceaccount\"  value:\"emoji\"}  tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}}  protocol_hint:{h2:{}}}  addrs:{addr:{ip:{ipv4:183763530}  port:8080}  weight:10000  metric_labels:{key:\"control_plane_ns\"  value:\"linkerd\"}  metric_labels:{key:\"deployment\"  value:\"emoji\"}  metric_labels:{key:\"pod\"  value:\"emoji-b79947575-f9s65\"}  metric_labels:{key:\"pod_template_hash\"  value:\"b79947575\"}  metric_labels:{key:\"serviceaccount\"  value:\"emoji\"}  tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}}  protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"

From [region] to [local] --> should have 1 pod removed

# See caveat (first have to go region - empty)
time="2020-08-13T13:53:34Z" level=debug msg="Updating service for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-13T13:53:34Z" level=debug msg="Sending destination add: add:{}" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080" 
# (no diff) all 3 pods were regional, when we go to empty we don't add anything new

# edit svc and go to [local] pods
time="2020-08-13T13:54:05Z" level=debug msg="Updating service for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-13T13:54:05Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname]" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-13T13:54:05Z" level=debug msg="Filtered 2 from a total of 3" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-13T13:54:05Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763530}  port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
# removed the extra pod that was region but not local!

From [local] to no_pref --> when we switched to local we had 2 pods left, we should now have all 3 again

time="2020-08-13T13:54:49Z" level=debug msg="Updating service for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-13T13:54:49Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763530}  port:8080}  weight:10000  metric_labels:{key:\"control_plane_ns\"  value:\"linkerd\"}  metric_labels:{key:\"deployment\"  value:\"emoji\"}  metric_labels:{key:\"pod\"  value:\"emoji-b79947575-f9s65\"}  metric_labels:{key:\"pod_template_hash\"  value:\"b79947575\"}  metric_labels:{key:\"serviceaccount\"  value:\"emoji\"}  tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}}  protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"

Let me know what you think! :)

In order to implement service topology, the destination service needs to filter all
available endpoints of a service based on its topological preference. This PR introduces
endpoint filtering in the translator component of the destination service. To achieve this,
the endpoint translator (listener) has now been made stateful. Based on the topology pref
of a service, its endpoints are filtered; updates are propagated through a diff-based API
on the translator side, similar to how addresses are being handled in the EndpointsWatcher.

Fixes #4503

Signed-off-by: Matei David <[email protected]>
@mateiidavid
Copy link
Member Author

mateiidavid commented Aug 14, 2020

@adleong resolved the comments but will reply here, just because there are some other things to consider when removing that check from inside the Add and Remove function.

When we move the len(set.TopologyPref) into the filterAddresses() method, we introduce a couple of side effects.

  • first, if we return et.availableEndpoints we are essentially returning a reference to all available endpoints, which in turn will be assigned to filterSnapshot. This means that when we will have to remove an address if we do it without any topological preference the previously filtered snapshot will not include the address that we want to remove, so the change will actually not be propagated. It's easier to demonstrate with an example:

suppose we have a set {A,B,C} of available endpoints, with no topological preference. This means that the following will be true:

     total = {A,B,C}  (Struct addr A1234)
     filterSnapshot = {A,B,C} (Same addr as total, A1234)

the addresses (memory address not IP) are the same because we returned et.availableEndpoints when we did the len check => filtered = total;
the filteredSnapshot, at the end of the filtering function, is replaced with the new set of filtered endpoints => snapshot = filter = total.

when we Remove(C), the following will happen:

    total = {A, B} (we remove C)
    filterSnapshot = {A,B} (same address as total, any change in total = change here)
    filter = {A, B} (`et.availableEndpoints`)
    => C is removed on the dest service side from available endpoints but change never propagated to proxy. there is no diff

dealing with this is not particularly hard, but it comes at the expense of having to copy the map of addresses every time we don't use a topological preference. The advantage is that the code looks much cleaner now, removed more lines from both Add(set) and Remove(set) in favour of replacing the topology preference of et.availableEndpoints in the filtered update method.

tl;dr code looks nice and clean but we have to copy the map of addresses.

  • second, I realised we never replaced the labels in the diff, we were returning only the AddressSets with addresses, so we never received any metricLabels. Also, we never overwrote (correct word?) the labels of the et.availableEndpoints set with the ones from the new set; this is now done in the filtered update. Crisis averted :)

  • third, going this route means I had to change some tests. The endpoint_translator is truly stateful now, and this had to be reflected in the tests. For example, we can't remove an endpoint that was not there to begin with; the expected values and arguments to some functions had to be changed to reflect the diff based API.

My conclusion is that although we have a slight performance cost, the API is more unified and consistent, tests reflect better what is done. Before, the tests did not point out that in the diff we did not include the metric labels, for example. The code also looks cleaner.

}

func (et *endpointTranslator) sendFilteredUpdate(set watcher.AddressSet) {
et.availableEndpoints = watcher.AddressSet{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it more concise here to write

et.availableEndpoints.Labels, et.availableEndpoints.TopologicalPref = set.Labels, set.TopologicalPref

instead of creating a new struct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes much of a difference either way. I think the way it's written right now is slightly clearer.

Update tests to reflect diff based API

Copy over metric labels to availEndpoints and to diffedAddresses

Signed-off-by: Matei David <[email protected]>
@mateiidavid mateiidavid requested a review from adleong August 17, 2020 16:47
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! Nice work.

}

func (et *endpointTranslator) sendFilteredUpdate(set watcher.AddressSet) {
et.availableEndpoints = watcher.AddressSet{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes much of a difference either way. I think the way it's written right now is slightly clearer.

Copy link
Member

@zaharidichev zaharidichev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works well! Fantastic piece of work @Matei207 !

@adleong adleong merged commit f797ab1 into linkerd:main Aug 18, 2020
adleong pushed a commit that referenced this pull request Aug 24, 2020
## What/How
@adleong  pointed out in #4780 that when enabling slices during an upgrade, the new value does not persist in the `linkerd-config` ConfigMap. I took a closer look and it seems that we were never overwriting the values in case they were different.

* To fix this, I added an if block when validating and building the upgrade options -- if the current flag value differs from what we have in the ConfigMap, then change the ConfigMap value.
* When doing so, I made sure to check that if the cluster does not support `EndpointSlices` yet the flag is set to true, we will error out. This is done similarly (copy&paste similarily) to what's in the install part.
* Additionally, I have noticed that the helm ConfigMap template stored the flag value under `enableEndpointSlices` field name. I assume this was not changed in the initial PR to reflect the changes made in the protocol buffer. The API (and thus the CLI) uses the field name `endpointSliceEnabled` instead. I have changed the config template so that helm installations will use the same field, which can then be used in the destination service or other components that may implement slice support in the future.

Signed-off-by: Matei David <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants