Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

policy: Serve EgressNetwork responses #13206

Merged
merged 11 commits into from
Oct 27, 2024
Merged

Conversation

zaharidichev
Copy link
Member

@zaharidichev zaharidichev commented Oct 21, 2024

This PR adds a few notable changes associated with the egress functionality of Linkerd:

  • EgressNetwork objects are indexed into the outbound index
  • outbound policy lookups are classfieid as either in-cluster or egress based on the ip:port combination
  • TCPRoute, TLSRoute, GRPCRoute and HTTPRoute attachments are reflected for both EgressNetwork and Service targets
  • the default traffic policy for EgressNetwork is honored by returning the appropriate default (failure/success) routes for all protocols

Note that this PR depends on an unreleased version of the linkerd2-proxy-api repo.

Signed-off-by: Zahari Dichev [email protected]

Signed-off-by: Zahari Dichev <[email protected]>
@zaharidichev zaharidichev requested a review from a team as a code owner October 21, 2024 14:02
@@ -33,7 +33,7 @@ impl Collector for Instrumented {
None,
MetricType::Gauge,
)?;
let service_infos = ConstGauge::new(this.service_info.len() as u32);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will come back and address metrics correctness in a follow-up if that is a non-blocker for now.

pub enum Kind {
EgressNetwork {
original_dst: SocketAddr,
traffic_policy: TrafficPolicy,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised to see TrafficPolicy here in the discover target. I think this means that the TrafficPolicy is looked up once at the beginning of the discovery lookup but then never updated. If the EgressNetwork resource is edited to change the TrafficPolicy, I don't think the discovery watches will see that change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually leads to a more difficult question about EgressNetwork mutability and ip address.

For services, a service's IP address will never change for the life of that services which means we can look up the service by IP once at the beginning of the discovery stream and never again.

For EgressNetworks, on the other hand, they encompass a range of IP addresses and that range can change if the EgressNetwork resource is edited. What happens when a discovery stream is started on an IP address that is in an EgressNetwork, but then the EgressNetwork is edited to no longer include that IP in it's network range? Or vice versa, what if an EgressNetwork is edited to include an IP address that was used for an already started discovery stream?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really seems like we need to take a different approach for EgressNetworks compared to the approach we have for services in the outbound index. Perhaps we can take advantage of the fact that cluster networks are immutable. When a discovery lookup comes in, we can look at if it's in the cluster network and send it to the service index or if it's out of the cluster network, then we can send it to the egress index.

The egress index will likely need to have watches indexed by ip address so that we can update the appropriate watches whenever an EgressNetwork resource changes and may encompass watches that it did not before or stop encompassing watches that it used to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points.. Will think about it a bit more.

Comment on lines 268 to 269
// if we have no explicit routes attached to the parent, always attempt protocol detection
if no_explicit_routes(&outbound) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this compare to the current approach of just starting with the most specific route type and falling back to less and less specific? ie if there are grpcRoutes, use those, else if there are httproutes, use those, else if there are tls routes, use those, else if there are tcp routes, use those, ELSE do http detect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. My approach is essentially the same but more complicated. Will incorporate your suggestion

fn timestamp_then_name<LM, LR, RM, RR>(
(left_id, left_route): &(GroupKindNamespaceName, OutboundRoute<LM, LR>),
(right_id, right_route): &(GroupKindNamespaceName, OutboundRoute<RM, RR>),
fn timestamp_then_name<L: Route, R: Route>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now we never compare routes of different types, so we can simplify this to only use one type parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will enforce, at the type level, that routes cannot be mixed.

Signed-off-by: Zahari Dichev <[email protected]>
@zaharidichev zaharidichev force-pushed the zd/egress-network-policy-3 branch from bf26d69 to 4144b19 Compare October 23, 2024 14:38
Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help myself follow all the logic here and see how it all fits together, I drew a picture as I went.
EgressNetwork responses

}
}

fn update_resource(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function needs to accept the traffic policy so that it can update its ParentMeta if the traffic policy changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will add a test as well

Comment on lines 12 to 15
pub enum OutboundPolicyKind {
Fallback(SocketAddr),
Resource(ResourceOutboundPolicy),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a long time to fully understand the mapping of OutboundPolicy to OutboundPolicyKind and then finally to proxy-api protobuf. It seems to me that OutboundPolicy + orig_dst already contains enough information to construct the proxy-api response (since the OutboundPolicy contains ParentMeta which stores TrafficPolicy).

Can we simplify this all by skipping the step of transforming OutboundPolicy into OutboundPolicyKind and instead just directly transform OutboundPolicy into proxy-api response? This will require passing in the orig_dst as supplemental information, but I think that's okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put this another way, OutboundPolicyKind::Fallback and OutboundPolicyKind::Resource will never be co-mingled in a single stream. Fallback will only be created if the OutboundDiscoverTarget is Fallback and Resource will only be created if the OutboundDiscoverTarget is Resource. So I think these can be total separate codepaths.

Comment on lines 34 to 36
// ParentMeta carries information resource-specific
// information about the parent to which outbound policy
// is associated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the purpose of this is that it's the information necessary to construct the default backend (minus the orig_dst since that needs to be filled in outside of the index). It might be helpful to have a comment explaining that that is it's purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might consider calling this something other than "Meta" too since, metadata makes me think of the resource metadata (i.e. group, kind, name)

#[derive(Clone, Debug)]
pub enum OutboundDiscoverTarget {
Resource(ResourceTarget),
Fallback(SocketAddr),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A name like "External" might be more descriptive than Fallback.

}),

protocol: Some(outbound::ProxyProtocol {
kind: Some(outbound::proxy_protocol::Kind::Opaque(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that for external targets that are not covered by an EgressNetwork, we'll return Opaque.

Does this differ from the current behavior? I assume that today we return NotFound for these and then the proxy uses the profile resolution. Does this amount to being Detect? Should we return Detect here instead of Opaque?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens today is:

  • for addresses that we cannot find here and are potentially in the cluster we return NotFound. The proxy may then use profiles information if available. So this behavior is preserved with this logic.
  • for addresses that are outside of the cluster we return NotFound and the proxy will not be able to find a profile. So what happens is that we will synthesize a default Detect policy that uses forward backends. So here we need to do the same as opposed to returning an opaque.

So you are right. Good catch. For context, here is where we do the synthesizing in the proxy: https://github.com/linkerd/linkerd2-proxy/blob/ef779d09191713c5f25682fc7a61212bb6752d21/linkerd/app/outbound/src/discover.rs#L144

),
};

// This encoder sets deprecated timeouts for older proxies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is accurate for TLSRoute

),
};

// This encoder sets deprecated timeouts for older proxies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is accurate for TCPRoute

// holds information about resources. currently EgressNetworks and Services
resource_info: HashMap<ResourceRef, ResourceInfo>,
cluster_networks: Vec<linkerd_k8s_api::Cidr>,
fallback_polcy_tx: watch::Sender<()>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth some comments on the semantics of this Sender and what it's purpose is.

@@ -435,7 +435,7 @@ where
drop(_tracing);
}

if std::env::var("POLICY_TEST_NO_CLEANUP").is_err() {
if std::env::var("POLICY_TEST_NO_CLEANUP").is_ok() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this right? it seems backwards. POLICY_TEST_NO_CLEANUP being ok means we should NOT delete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover

@@ -26,6 +26,7 @@ async fn service_does_not_exist() {
let mut policy_api = grpc::OutboundPolicyClient::port_forwarded(&client).await;
let rsp = policy_api.watch(&ns, &svc, 4191).await;

println!("{:?}", rsp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray println

Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
/// specific metadata that is used when putting together the final
/// policy response.
#[derive(Clone, Debug, PartialEq)]
pub enum ResourceOutboundPolicy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this type is unused and can be removed.

Comment on lines 285 to 286
// This encoder sets deprecated timeouts for older proxies.
#![allow(deprecated)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed if we use ..Default::default() to fill in default values for request_timeout instead of setting to None explicitly.

Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
@zaharidichev zaharidichev merged commit b0f087c into main Oct 27, 2024
42 checks passed
@zaharidichev zaharidichev deleted the zd/egress-network-policy-3 branch October 27, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants