-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
service topologies: topology-aware service routing #4780
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me but I'd like to see it in the context of the full topology aware routing feature. It's a small enough change that we might consider just rolling it into a more feature complete PR rather than reviewing and merging it on its own. What do you think?
I think this sounds perfect. I've marked this PR as a draft, once #4771 is merged I can start adding more to this PR and discuss things further. |
Updated! @adleong if you'd be so kind 😄 also, unit tests are failing because I updated rbac for dest service to be able to get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see this coming together!
I have a suggestion for a restructuring of this that I think could simplify things. What if, instead:
- the endpointTranslator struct holds the topology map of the source node. In other words
newEndpointTranslator
would add a k8s client parameter and a nodeName parameter, look up the topology label map for that node, and store the resulting map in the endpointTranslator struct. I think it is a fair assumption that these node labels will not change during the lifetime of the node. - Add the array of topology key preferences to the
AddressSet
struct. I think this makes sense because these preferences come from the service and this will make our life easier. - This allows the endpoints watcher to be totally unaware of the filtering entirely and the endpoint translator has all the information it needs to do filtering inline. It knows the source nodes topology values, it knows the topology key preference from the Added AddressSet, and it knows the topology values for each address.
- This lets us get rid of the TopologyFilter type and the AddTopologyFilter function.
Here's where things get complicated, though:
Adding filtering introduces some new complexity to our diff-based (Add, Remove) API. Let me illustrate with an example: suppose that you're subscribed to a service with 1 endpoint local to your node and 3 endpoints on other nodes in your zone. If your topology preference says that you prefer node then zone, then the Destination API will return you an Add
with the 1 node local endpoint. If that endpoint is later deleted, however, the endpoint watcher will call Remove
to remove this endpoint, but we also need to somehow trigger an Add
of the 3 zone local endpoints in order to fall back to these once no node local endpoints exist.
This is tricky. One approach would be to have the endpoint translator become more statful and keep track of the total set of endpoints it knows about AND the filtered set of endpoints it has told the client about. When it receives an update (an Add, Remove, or NoEndpoints), it does the following:
- Updates the total set of endpoints based on this update
- Using the topology preference, calculates the filtered set of endpoints from the total set
- Diffs the calculated filtered set against the previously stored filtered set
- Transmits this diff to the client and updates the stored filtered set
Let me know what you think.
} | ||
|
||
// If we reach this point it means we either have no satisfied topologies | ||
// or we encountered the wildcard. Return in both cases what we previously had |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we reach the end of the list but do not encounter the wildcard, we should return NoEndpoints rather than the full set.
If this field is specified and all entries have no backends that match the topology of the client, the service has no backends for that client and connections should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very valid point of view. I thought similarily but I didn't see any easy way of sending a NoEndpoints
message since the filtering happens in Add
based on how I have done things. I think in this scenario we could either have the filtering done before calling Add
or return an empty set from the filtering method. Hope I didn't miss anything.
controller/api/destination/server.go
Outdated
@@ -109,7 +109,13 @@ func (s *server) Get(dest *pb.GetDestination, stream pb.Destination_GetServer) e | |||
return status.Errorf(codes.InvalidArgument, "Invalid authority: %s", dest.GetPath()) | |||
} | |||
|
|||
var token contextToken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we move this block up, we can pass the node name into the newEndpointTranslator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great!
@@ -49,6 +51,7 @@ type ( | |||
OwnerKind string | |||
Identity string | |||
AuthorityOverride string | |||
Topology map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Topology is a slightly weird name for this field since these are more like topology coordinates. Maybe "TopologyLabels"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, will switch it around.
@adleong thank you for your thoroughness, as always. Your observations are excellent and I don't have much to add to it.
I thought of the same approach at first but decided to go with the filter structure. I didn't know if it would be preferred to make another client and keep the node information in the translator -- I took it as a separation of concerns sort of situation where the listener should not know anything about the k8s API (incl clientset & idea of a node) and just deal with
Precisely. I think the only way this would be affected is if the labels are manually changed.
...I completely missed this 🤕. An amazing suggestion I think.
I think that I really missed how simple this could have been done and for sure needlessly complicated everything.
Hm, this is a design choice that we haven't really covered. I think my initial assumption was a bit different.
I suppose the second approach has the advantage of fewer side effects since we know the preference that has been satisfied will be enforced throughout the duration of the pod's life (as an ephemeral object this shouldn't an issue, pods are short-lived). This invites a bit more interaction from the operator; in the case that the first preference has endpoints come up you have to delete the (source) pods to make the subscription work again. This also means less code complexity and (what I completely assume to be) better performance (??). The simplicity here stems from having to deal with no alternatives -- for example, what happens if in the first instance we fallback to our second choice but then the first choice magically has a pod spin up -- do we terminate the connection and switch back to the first preference? The first approach that you have described is what I like best. It comes closer to what I thought this feature to be. Would we only fallback or also switch back to the first preference (like described above)? I think the main advantage of this is that it provides actual topology-aware routing, fallback mechanisms and just overall makes everything better and more fun. The main downside I can think of is having the listener be more stateful, there are more things that can potentially go wrong. This is all just at first glance, I'm operating on surface knowledge here so not sure if I'm right about it, but thought it'd be cool to talk about it.
To be honest, I think this is a very good way to do it. I will make all of the changes so you can see how it would look like in practice and we can take it from there? Perhaps more things will become more apparent once the code and my inconsistencies are out in public view, heh. |
@adleong (and whoever else feels like going through this PR) I have compiled a list of basic scenarios to ensure locality and fallback functionality are enforced when filtering. I have also uncovered a nasty bug where on update, slices wouldn't have any topology preferences assigned to them. The issue here was that the set in Tests:When testing, I sought to:
For these tests scenarios, my preference has been one of the three:
# add this to deployment (I used emoji-svc which is accessed by web pods)
# affinity to force local, replace kind-worker2 with your preferred node
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- kind-worker2
# affinity to allow local + region
# replace kind-worker2 with regional node
# replace kind-worker with local node
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- kind-worker
weight: 1
- preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- kind-worker2
weight: 99
# Note regional and local are according to your web pod(s) if using emoji svc Test scenariosBased on the above, here are my scenarios (along w/ logs)
time="2020-08-05T16:08:03Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker2" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:08:03Z" level=debug msg="Considering address [10.244.1.68] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:08:03Z" level=debug msg="Considering address [10.244.1.70] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:08:03Z" level=debug msg="Considering address [10.244.2.74] with topology labels [map[kubernetes.io/hostname:kind-worker topology.kubernetes.io/region:west topology.kubernetes.io/zone:west-1a]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:08:03Z" level=debug msg="Filtered 2 from a total of 3" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
*Send dest add*
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
emoji-788fdcd8-2h846 2/2 Running 0 63m 10.244.2.74 kind-worker <none> <none>
emoji-788fdcd8-fcmss 2/2 Running 0 65m 10.244.1.70 kind-worker2 <none> <none>
emoji-788fdcd8-j5wqv 2/2 Running 0 65m 10.244.1.68 kind-worker2 <none> <none> 1.b remove a local endpoint from {A, B, C} time="2020-08-05T16:11:03Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname]" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker2" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Considering address [10.244.1.70] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Considering address [10.244.2.74] with topology labels [map[kubernetes.io/hostname:kind-worker topology.kubernetes.io/region:west topology.kubernetes.io/zone:west-1a]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Filtered 1 from a total of 2" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:11:03Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763268} port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
# (Note: filtered 1 from a total of 2 because at the start, we remove B from the list of total endpoints which is used for filtering) 1.c add a new local endpoint to {A,C} time="2020-08-05T16:19:18Z" level=debug msg="Updating EndpointSlice for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-05T16:19:18Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname]" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker2" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Considering address [10.244.1.73] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Considering address [10.244.1.74] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Considering address [10.244.2.74] with topology labels [map[kubernetes.io/hostname:kind-worker topology.kubernetes.io/region:west topology.kubernetes.io/zone:west-1a]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Considering address [10.244.1.72] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:east topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Filtered 3 from a total of 4" addr=":8086" component=endpoint-translator remote="127.0.0.1:58348" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:19:18Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763274} port:8080}
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
emoji-785f4cbb66-g29kx 2/2 Running 0 47s 10.244.1.72 kind-worker2 <none> <none>
emoji-785f4cbb66-hmkxr 2/2 Running 0 38s 10.244.1.73 kind-worker2 <none> <none>
emoji-785f4cbb66-sbjpz 2/2 Running 0 26s 10.244.1.74 kind-worker2 <none> <none>
emoji-788fdcd8-2h846 2/2 Terminating 0 74m 10.244.2.74 kind-worker <none> <none>
emoji-788fdcd8-fcmss 1/2 Terminating 0 76m 10.244.1.70 kind-worker2 <none> <none>
# (Note: for the last one used affinity which is why we had so many pods; without affinity the pod kept popping up on worker2)
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
emoji-78ffc6d99f-2zmdp 2/2 Running 0 3m5s 10.244.1.80 kind-worker2 <none> <none>
emoji-78ffc6d99f-548pg 2/2 Running 0 3m17s 10.244.2.81 kind-worker <none> <none>
emoji-78ffc6d99f-86wq6 2/2 Running 0 2m51s 10.244.1.81 kind-worker2 <none> <none>
vote-bot-6b5f6d9b8f-xbpms 2/2 Running 0 112m 10.244.1.67 kind-worker2 <none> <none>
voting-755f879fd7-99tpt 2/2 Running 0 112m 10.244.1.69 kind-worker2 <none> <none>
web-76868b75cb-m8594 2/2 Running 0 13m 10.244.2.79 kind-worker <none> <none>
time="2020-08-05T16:56:18Z" level=debug msg="Updating EndpointSlice for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-05T16:56:18Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname topology.kubernetes.io/region]" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Considering address [10.244.1.80] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Considering address [10.244.1.81] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Preferring endpoints with same topology topology.kubernetes.io/region as source west" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Considering address [10.244.1.81] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Considering address [10.244.1.80] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Filtered 2 from a total of 2" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763281} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-86wq6\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}} addrs:{addr:{ip:{ipv4:183763280} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-2zmdp\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:18Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763537} port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080" 2.b add new local endpoint and switch traffic to local instead of region in {B: region, C: region} time="2020-08-05T16:56:33Z" level=debug msg="Updating EndpointSlice for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-05T16:56:33Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname topology.kubernetes.io/region]" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Considering address [10.244.1.80] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Considering address [10.244.1.81] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Considering address [10.244.2.82] with topology labels [map[kubernetes.io/hostname:kind-worker topology.kubernetes.io/region:west topology.kubernetes.io/zone:west-1a]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Filtered 1 from a total of 3" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763538} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-nt79c\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T16:56:33Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763280} port:8080} addrs:{ip:{ipv4:183763281} port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
emoji-78ffc6d99f-2zmdp 2/2 Running 0 6m12s 10.244.1.80 kind-worker2 <none> <none>
emoji-78ffc6d99f-86wq6 2/2 Running 0 5m58s 10.244.1.81 kind-worker2 <none> <none>
emoji-78ffc6d99f-nt79c 2/2 Running 0 2m44s 10.244.2.82 kind-worker <none> <none>
vote-bot-6b5f6d9b8f-xbpms 2/2 Running 0 115m 10.244.1.67 kind-worker2 <none> <none>
voting-755f879fd7-99tpt 2/2 Running 0 115m 10.244.1.69 kind-worker2 <none> <none>
web-76868b75cb-m8594 2/2 Running 0 16m 10.244.2.79 kind-worker <none> <none>
Update:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
emoji-78ffc6d99f-2zmdp 2/2 Running 0 24m 10.244.1.80 kind-worker2 <none> <none>
emoji-78ffc6d99f-6f8dj 2/2 Running 0 2m13s 10.244.2.84 kind-worker <none> <none>
emoji-78ffc6d99f-86wq6 2/2 Running 0 23m 10.244.1.81 kind-worker2 <none> <none>
vote-bot-6b5f6d9b8f-xbpms 2/2 Running 0 133m 10.244.1.67 kind-worker2 <none> <none>
voting-755f879fd7-99tpt 2/2 Running 0 133m 10.244.1.69 kind-worker2 <none> <none>
web-76868b75cb-m8594 2/2 Running 0 34m 10.244.2.79 kind-worker <none> <none>
time="2020-08-05T17:17:41Z" level=debug msg="Updating EndpointSlice for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-05T17:17:41Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname *]" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Preferring endpoints with same topology kubernetes.io/hostname as source kind-worker" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Considering address [10.244.1.80] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Considering address [10.244.1.81] with topology labels [map[kubernetes.io/hostname:kind-worker2 topology.kubernetes.io/region:west topology.kubernetes.io/zone:east-1c]] for filtering" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763280} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-2zmdp\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}} addrs:{addr:{ip:{ipv4:183763281} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-78ffc6d99f-86wq6\"} metric_labels:{key:\"pod_template_hash\" value:\"78ffc6d99f\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-05T17:17:41Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763540} port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:38562" service="emoji-svc.emojivoto.svc.cluster.local:8080"
# Note: no debug message for when we choose *, this works removed deleted pod and added 2 pods (both on kind-worker2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fantastic. Going to do some manual testing...
@adleong resolved feedback comments, changed some unmentioned code comments for clarity and moved function blocks around to make the file more readable. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really fantastic works! Great feedback from @adleong. Will do some manual testing as well. +1 for the testing instructions!
Update 11th of August: I have updated the test fixtures so unit tests can pass (modified RBAC). I spoke briefly to @zaharidichev today and thought it would be a good idea to get this out of the draft stage. @adleong and @zaharidichev please let me know when you have time how the testing went and if there is anything that should be changed for this PR! |
It doesn't seem like endpoint slices can be enabled during an upgrade:
|
After playing with this for a little bit it seems to work great! The only issue I was able to find was that if you update a service and change its topology preference, that change doesn't take effect. I think this is because when we update a service in the endpointsWatcher, we don't propagate the topology preferences from the service to the current |
@adleong thank you for both of your comments! I have opened up a PR to resolve the upgrade issue. For the service update in the watcher, I will have a closer look and try to solve it asap. I made the mistake of thinking that when we update a service we automatically update all ports (which in turn re adds all of the slices => new pp.addresses). I was wrong though because the updates happen only when the port is changed in the service. I suppose we would need to send an empty Add to re-trigger the filtering rather than update the preference? Either way, I think sending an empty Add with the updated preference is less of an expensive operation compared to re-adding all of the slices in. |
@adleong @zaharidichev I have pushed some new changes to fix the issue with services not updating. I would appreciate it if you could give me some feedback in terms of the code quality & style, as well as running through some test scenarios to ensure this works as expected.
ChangesI have made two-three changes that I will detail here just so you are aware of what my rationale has been.
Tests:
A copy of your changes has been stored to "/var/folders/4j/vjq4hnmn6tlf89n_kl55c9zh0000gn/T/kubectl-edit-wv0fo.yaml"
error: unable to find api field in struct ServiceSpec for the json field "topologyKeys" Not sure if this is because I edited with
# Pods
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
emoji-b79947575-2c54l 2/2 Running 0 3m24s 10.244.1.95 kind-worker2 <none> <none>
emoji-b79947575-b7xb2 2/2 Running 0 2m52s 10.244.1.97 kind-worker2 <none> <none>
emoji-b79947575-f9s65 2/2 Running 0 45s 10.244.2.74 kind-worker <none> <none>
vote-bot-6cf96b6b76-nnql2 2/2 Running 0 3m24s 10.244.2.73 kind-worker <none> <none>
voting-79f4b5ccd4-5lt2f 2/2 Running 0 3m24s 10.244.2.72 kind-worker <none> <none>
web-66689dbb48-6mfw2 2/2 Running 0 3m24s 10.244.1.94 kind-worker2 <none> <none>
The pods are local and regional to the src pod
time="2020-08-13T13:53:09Z" level=debug msg="Updating service for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-13T13:53:09Z" level=debug msg="Filtering through address set with preference [topology.kubernetes.io/region]" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-13T13:53:09Z" level=debug msg="Filtered 3 from a total of 3" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-13T13:53:09Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763297} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-b79947575-b7xb2\"} metric_labels:{key:\"pod_template_hash\" value:\"b79947575\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}} addrs:{addr:{ip:{ipv4:183763295} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-b79947575-2c54l\"} metric_labels:{key:\"pod_template_hash\" value:\"b79947575\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}} addrs:{addr:{ip:{ipv4:183763530} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-b79947575-f9s65\"} metric_labels:{key:\"pod_template_hash\" value:\"b79947575\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
# See caveat (first have to go region - empty)
time="2020-08-13T13:53:34Z" level=debug msg="Updating service for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-13T13:53:34Z" level=debug msg="Sending destination add: add:{}" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
# (no diff) all 3 pods were regional, when we go to empty we don't add anything new
# edit svc and go to [local] pods
time="2020-08-13T13:54:05Z" level=debug msg="Updating service for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-13T13:54:05Z" level=debug msg="Filtering through address set with preference [kubernetes.io/hostname]" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-13T13:54:05Z" level=debug msg="Filtered 2 from a total of 3" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
time="2020-08-13T13:54:05Z" level=debug msg="Sending destination remove: remove:{addrs:{ip:{ipv4:183763530} port:8080}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080"
# removed the extra pod that was region but not local!
time="2020-08-13T13:54:49Z" level=debug msg="Updating service for emojivoto/emoji-svc" addr=":8086" component=service-publisher ns=emojivoto svc=emoji-svc
time="2020-08-13T13:54:49Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183763530} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"emoji\"} metric_labels:{key:\"pod\" value:\"emoji-b79947575-f9s65\"} metric_labels:{key:\"pod_template_hash\" value:\"b79947575\"} metric_labels:{key:\"serviceaccount\" value:\"emoji\"} tls_identity:{dns_like_identity:{name:\"emoji.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}}}" addr=":8086" component=endpoint-translator remote="127.0.0.1:60626" service="emoji-svc.emojivoto.svc.cluster.local:8080" Let me know what you think! :) |
In order to implement service topology, the destination service needs to filter all available endpoints of a service based on its topological preference. This PR introduces endpoint filtering in the translator component of the destination service. To achieve this, the endpoint translator (listener) has now been made stateful. Based on the topology pref of a service, its endpoints are filtered; updates are propagated through a diff-based API on the translator side, similar to how addresses are being handled in the EndpointsWatcher. Fixes #4503 Signed-off-by: Matei David <[email protected]>
@adleong resolved the comments but will reply here, just because there are some other things to consider when removing that check from inside the When we move the
suppose we have a set {A,B,C} of available endpoints, with no topological preference. This means that the following will be true:
the addresses (memory address not IP) are the same because we returned when we Remove(C), the following will happen:
dealing with this is not particularly hard, but it comes at the expense of having to copy the map of addresses every time we don't use a topological preference. The advantage is that the code looks much cleaner now, removed more lines from both tl;dr code looks nice and clean but we have to copy the map of addresses.
My conclusion is that although we have a slight performance cost, the API is more unified and consistent, tests reflect better what is done. Before, the tests did not point out that in the diff we did not include the metric labels, for example. The code also looks cleaner. |
} | ||
|
||
func (et *endpointTranslator) sendFilteredUpdate(set watcher.AddressSet) { | ||
et.availableEndpoints = watcher.AddressSet{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it more concise here to write
et.availableEndpoints.Labels, et.availableEndpoints.TopologicalPref = set.Labels, set.TopologicalPref
instead of creating a new struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes much of a difference either way. I think the way it's written right now is slightly clearer.
Update tests to reflect diff based API Copy over metric labels to availEndpoints and to diffedAddresses Signed-off-by: Matei David <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! Nice work.
} | ||
|
||
func (et *endpointTranslator) sendFilteredUpdate(set watcher.AddressSet) { | ||
et.availableEndpoints = watcher.AddressSet{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes much of a difference either way. I think the way it's written right now is slightly clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works well! Fantastic piece of work @Matei207 !
## What/How @adleong pointed out in #4780 that when enabling slices during an upgrade, the new value does not persist in the `linkerd-config` ConfigMap. I took a closer look and it seems that we were never overwriting the values in case they were different. * To fix this, I added an if block when validating and building the upgrade options -- if the current flag value differs from what we have in the ConfigMap, then change the ConfigMap value. * When doing so, I made sure to check that if the cluster does not support `EndpointSlices` yet the flag is set to true, we will error out. This is done similarly (copy&paste similarily) to what's in the install part. * Additionally, I have noticed that the helm ConfigMap template stored the flag value under `enableEndpointSlices` field name. I assume this was not changed in the initial PR to reflect the changes made in the protocol buffer. The API (and thus the CLI) uses the field name `endpointSliceEnabled` instead. I have changed the config template so that helm installations will use the same field, which can then be used in the destination service or other components that may implement slice support in the future. Signed-off-by: Matei David <[email protected]>
Link to RFC
Updated (29 Jul): Initially, this PR was supposed to be only for adding topology preference to
servicePublisher
and for collecting topology information from slices. However, the PR (now marked as a draft) covers the prioritisation of endpoints for topology-aware service routing.What
<k,v>
pairs for endpoints.Add
filtering endpoints based on the topology preference of the service, topology<k,v>
pairs of endpoints and topology of the source (again<k,v>
pairs).How
Collecting metadata:
Services do not have values for topology keys -- the topological keys defined in a service's spec are only there to dictate locality preference for routing; as such, I decided to store them in an array, they will be taken exactly as they are found in the service spec, this ensures we respect the preference order.
For EndpointSlices, we are using a map -- an EndpointSlice has locality information in the form of
<k,v>
pair, where the key is a topological key (similar to what's listed in the service) and the value is the locality information -- e.ghostname: minikube
. For each address we now have a map of topology values which gets populated when we translate the endpoints to an address set. Because normal Endpoints do not have any topology information, we create each address with an empty map which is subsequently populated ONLY for slices in theendpointSliceToAddressSet
function.Filtering endpoints:
This was a tricky part and filled me with doubts. I think there are a few ways to do this, but this is how I "envisioned" it. First, the
endpoint_translator.go
should be the one to do the filtering; this means that on subscription, we need to feed all of the relevant metadata to the listener. To do this, I created a new functionAddTopologyFilter
as part of the listener interface.To complement the
AddTopologyFilter
function, I created a newTopologyFilter
struct inendpoints_watcher.go
. I then embedded this structure in all listeners that implement the interface. The structure holds the source topology (source node), a boolean to tell if slices are activated in case we need to double check (or write tests for the function) and the service preference. We create the filter on Subscription -- we have access to the k8s client here as well as the service, so it's the best point to collect all of this data together. Addresses all have their own topology added to them so they do not have to be collected by the filter.When we add a new set of addresses, we check to see if slices are enabled -- chances are if slices are enabled, service topology might be too. This lets us skip this step if the latest version is not adopted. Prior to sending an
Add
we filter the endpoints -- if the preference is registered by the filter we strictly enforce it, otherwise nothing changes.And that's pretty much it.
Points of concern
I have been thinking about this for a while, I think this architectural choice should work fine but is there any cause of concern when it comes to scaling? What would happen when a cluster has 20 services with 5 EndpointSlices each, would adding extra metadata (topology map) degrade performance? I would appreciate some rigid feedback in this area to figure out if this is the right choice.
Again, related to the architecture, I am not sure what other ways we have to collect all of the data together other than having a dedicated structure or storing the data in the listener structure. There are a couple of pain points here, we could pass everything as an argument to
Add
but there is no point where we will send anAdd
AND we have access to the k8s client and service in question. We could initialise a client and get all of this data but I find that it will slow everything down. Doing everything inSubscribe
to me seems to be the simpler and more efficient way, but as always I look forward to finding a better and cleaner way to do this.Finally, I left some scattered comments throughout the PR -- this is intentional, there are alternative concerns over performance or behaviour that I have. Will clean them up when the draft is in a good state, feel free to ignore them or debate them, wanted to show what my reasoning is or ask indirect questions of the reviewer.
At the moment, if a service changes its topology, in order for it to be enforced all pods have to be re subscribed (so deleted). This isn't good (?).
Tests
Testing is not very straightforward so, I'll detail how I tested everything. I'm collating more cases, for now I have just tested the preference is enforced.
To test this, minikube won't help much -- we need at least two nodes with different hostnames, regions and zones. Kind was the best option for me, since it's bootstrapped with
kubeadm
-- you can label each node as you want on creation so you can test multiple scenarios. You can find my kind config here.Next, I decided to use
emojivoto
as my test services.web-svc
feeds intoemoji-svc
, so to test topology I editedemoji deployment
: scaled replicas to 2 (I only used 2 nodes, 3+ kind nodes make my laptop take off) -- there is no need for affinity in this case because Kubernetes will automatically spread for resilience. For 3 nodes, scale to 3 replicas etc. You can use anti affinity to space the pods out if you wish. Next is adding the topological preference to the service, inemoji-svc
add :Supported labels are
["kubernetes.io/hostname", "topology.kubernetes.io/region", "topology.kubernetes.io/zone","*"]
Note:
*
has to go last. You can re-edit topologies! To change preference you must first delete the topology and then re-write it. This was the case with me; I edited the service usingkubectl edit
though.web
pod to force re subscriptions. I have added quite a few debug statements so you can observe what's going on -- these will not stay in but it helps to figure what the call trace is like. Make sure you start the controllers with debug log level.For my test, this is what we get for 2 replicas of
emoji
spread across 2 nodes, and 1 ofweb
connecting to a topology of node local:To show it also works as normal with no pref specified, I edited the svc, deleted the pref, re-started web pod: