Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhanced gateway api topology #751

Merged
merged 5 commits into from
Jul 22, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions controllers/authpolicy_status.go
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,8 @@ func (r *AuthPolicyReconciler) generateTopology(ctx context.Context) (*kuadrantg
})

return kuadrantgatewayapi.NewTopology(
kuadrantgatewayapi.WithAcceptedRoutesLinkedOnly(),
kuadrantgatewayapi.WithProgrammedGatewaysOnly(true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what problem this is solving, other than skipping applying the getaway overrides on top of route policies when the getaway is not in a programmed state. The getaway policy will still be reconciled as valid, the gateway AuthConfig will still be created (along and possibly clashing with the route one if it exists), and all the policy statuses will be reported as fine.

I'm afraid we may be here hiding the actual problem which in this case is that the gateway is not ready, therefore we shouldn't configure policies for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the described problem is not introduced by this PR, but neither it's fixed by it though. In fact, I'm afraid the change may even aggravate the problem.

The more fundamental problem is of course that, in this case, the topology is only marginal to the decisions we make. Most of the business logic for reconciling AuthPolicies still won't leverage the topology graph built by this function, while some already do.

By trying to fix only the topology graph, we're effectively providing two views of world, thus splitting the business logic into the portion that creates configuration based on one reality and another based on a different reality.

I fear that, by trying to do the right thing here, we're actually bringing the system to an inconsistent state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

I am confused. I will kindly ask you to elaborate more on this because I do not get it.

I will try to clarify some of your statements:

BTW, the described problem is not introduced by this PR, but neither it's fixed by it though.

This PR does not pretend to change a bit the behavior of any of the controllers. All the tests pass. There is no problem being described and thus, no problem being fixed.

This PR is adding few more options to the DAG topology, yet keeping the controllers behavior as they are. The intention is to add more tools/capabilities on the topology lib to enable following up PR's introducing the Limitador limits controller.

The authcontroller in current main branch is building a topology with only programmed gateways and linking only accepted routes. So is doing the code with the new changes in this PR. Nothing changes. And if something changes, let me know and I will fix it.

In fact, I'm afraid the change may even aggravate the problem.

First of all, what problem are you talking about?

Configuration for "programmed gateways only" option was introduced by #715. This PR is adding (among other public methods) yet another configuration option which is opt out from "link accepted routes only". As you know, currently the topology has hard-coded linking only accepted routes. This may not be always desirable. This feature is not needed now, but will be needed by the limits controller. The reasoning being that the controller will need a topology in its mappers which routes are linked to the gateway no matter they are accepted or not. But this is something I can explain further when I open PR for the limits controller.

I do not see any problem being aggravated. Again, this is not changing any controller behavior. And if it does, I will fix it.

By trying to fix only the topology graph, we're effectively providing two views of world, thus splitting the business logic into the portion that creates configuration based on one reality and another based on a different reality.

This is by no means fixing the topology. AFAIK, there is no issue with the current topology for the needs of the current controllers. This is about adding more configurability to the topology. Some business logic requires a type of topology with not programmed gateways as well as linked routes that have not been accepted. Specifically I am thinking on the mappers. Mappers need to walk the topology. And having disconnections because routes are not accepted prunes events to reconcile kuadrant policies.

Copy link
Contributor

@guicassolato guicassolato Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration for "programmed gateways only" option was introduced by #715. […] As you know, currently the topology has hard-coded linking only accepted routes.

This indeed confirms the PR does not aggravate the problem, and it's enough for me to approve it. Thanks for explaining it and for collecting the evidences in the code to support it, @eguzki .

To elaborate on my main criticism, which I still stand for...

Disclaimer for all readers: please try to read in between the lines and focus on the generalized form of the problem I'm trying to describe, rather than the individual examples I use.

I missed #715. I think this is a very good example of how a well-intended change, that we needed indeed and that, looked individually, made our code better than before, but, at the same time, if observed holistically, emerges nonetheless as a change that creates a problem (or aggravates one) we didn't have before (or that simply had a smaller surface before.)

We spotted we were treating non-programmed gateways as if they were ready and thus we were signaling the policies configured on those gateways as enforced. This was clearly a bug and had to be fixed. We have tons of code that implement business logic based on a "view of the world" provided by this topology struct. We made that topology struct to only account for programmed gateways. Done. Problem solved. We fixed all those business logic with one move.

However, unfortunately there are controllers (possibly all of them AFAIK!) that mix business logic based on the topology struct with business logic that does not use the topology struct, where sometimes the status of the Gateways and HTTPRoutes are properly taken into account, sometimes not. (I mentioned the reconciliation of AuthConfigs before, but look at ComputeGatewayDiff for plenty of more examples.)

With the topology struct now doing what seems to be the right thing, we created two "views of the world" – one that we believe is right and another one that may or may not be right. These two views (as well as the decisions based on them) are often intertwined together, leading to a situation where we can no longer say "Kuadrant overlooks the state of the gateways. As a user, it's up to you to make sure to bring all your gateways to a programmed state. Until then, do not trust status of Kuadrant policies." Unfortunately, neither we can say today, after #715, "No worries! Kuadrant makes sure to always configure and report status of your policies right, according to the state of the gateways." Between the two statements, obviously we want to issue the latter, but if that's not true, then we're probably better off with the former IMO.

I used "programmed gateways" as example, but we can run over again replacing it with "accepted routes", or "listener status", and possibly others that do not occur to me right now. The important part is, while we are not consistent about the views of the world that the code bases its decisions upon, it becomes very hard to debug behaviors such as:

  • why a policy with status enforced doesn't seem to be enforced at all
  • why an override that seems not have been accepted appears to be taking over my route-level rules (or the other way around)
  • etc

To sum up, let's be careful with changes that seem to be improving the code but until they are rolled out completely they may actually put us (as well as our users) in a difficult position. This is not the case of this PR as I'm now convinced, and neither it is a negative criticism to the DAG/topology approach in general. Rather, I'm concerned with how we roll out these changes.

kuadrantgatewayapi.WithGateways(utils.Map(gwList.Items, ptr.To[gatewayapiv1.Gateway])),
kuadrantgatewayapi.WithRoutes(utils.Map(routeList.Items, ptr.To[gatewayapiv1.HTTPRoute])),
kuadrantgatewayapi.WithPolicies(policies),
Expand Down
1 change: 1 addition & 0 deletions controllers/ratelimitpolicy_enforced_status_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,7 @@ func (r *RateLimitPolicyEnforcedStatusReconciler) buildTopology(ctx context.Cont
}

return kuadrantgatewayapi.NewTopology(
kuadrantgatewayapi.WithAcceptedRoutesLinkedOnly(),
kuadrantgatewayapi.WithGateways(utils.Map(gatewayList.Items, ptr.To[gatewayapiv1.Gateway])),
kuadrantgatewayapi.WithRoutes(utils.Map(routeList.Items, ptr.To[gatewayapiv1.HTTPRoute])),
kuadrantgatewayapi.WithPolicies(utils.Map(policyList.Items, func(p kuadrantv1beta2.RateLimitPolicy) kuadrantgatewayapi.Policy { return &p })),
Expand Down
2 changes: 2 additions & 0 deletions controllers/ratelimitpolicy_limits.go
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,8 @@ func (r *RateLimitPolicyReconciler) buildTopology(ctx context.Context, policies
}

return kuadrantgatewayapi.NewTopology(
kuadrantgatewayapi.WithAcceptedRoutesLinkedOnly(),
kuadrantgatewayapi.WithProgrammedGatewaysOnly(true),
kuadrantgatewayapi.WithGateways(utils.Map(gwList.Items, ptr.To[gatewayapiv1.Gateway])),
kuadrantgatewayapi.WithRoutes(utils.Map(routeList.Items, ptr.To[gatewayapiv1.HTTPRoute])),
kuadrantgatewayapi.WithPolicies(policies),
Expand Down
2 changes: 2 additions & 0 deletions controllers/target_status_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,8 @@ func (r *TargetStatusReconciler) buildTopology(ctx context.Context, gw *gatewaya
}

t, err := kuadrantgatewayapi.NewTopology(
kuadrantgatewayapi.WithAcceptedRoutesLinkedOnly(),
kuadrantgatewayapi.WithProgrammedGatewaysOnly(true),
kuadrantgatewayapi.WithGateways([]*gatewayapiv1.Gateway{gw}),
kuadrantgatewayapi.WithRoutes(utils.Map(routeList.Items, ptr.To[gatewayapiv1.HTTPRoute])),
kuadrantgatewayapi.WithPolicies(policies),
Expand Down
19 changes: 19 additions & 0 deletions pkg/library/gatewayapi/helper_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -122,3 +122,22 @@ func testBasicRoutePolicy(name, namespace string, route *gatewayapiv1.HTTPRoute)
},
}
}

func testStandalonePolicy(name, namespace string) Policy {
return &TestPolicy{
TypeMeta: metav1.TypeMeta{
APIVersion: "example.com/v1",
Kind: "TestPolicy",
},
ObjectMeta: metav1.ObjectMeta{
Namespace: namespace,
Name: name,
},
TargetRef: gatewayapiv1alpha2.PolicyTargetReference{
Group: gatewayapiv1.Group(gatewayapiv1.GroupName),
Kind: gatewayapiv1.Kind("Gateway"),
Namespace: ptr.To(gatewayapiv1.Namespace("unknown")),
Name: gatewayapiv1.ObjectName("unknown"),
},
}
}
Loading
Loading