Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple gRPC client management #93

Merged
merged 10 commits into from
Oct 27, 2023

Conversation

levikobi
Copy link
Member

@levikobi levikobi commented Jul 19, 2023

Fixes #46.

@CLAassistant
Copy link

CLAassistant commented Jul 19, 2023

CLA assistant check
All committers have signed the CLA.

@shaneutt
Copy link
Member

Awesome! Thanks for putting in the effort on this one, let's run CI and see how we do.

Also please do check out #42 (comment), we should be really close to completing our donation to the CNCF and we're trying to document CNCF CLA completion for all contributors as a requirement.

Copy link
Member

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping in to do this one @levikobi 🥳

One thing I noticed while reviewing is that we've added a bit of machinery of our own to reconcile things, e.t.c the observer interface and the SetupReconcilation method as opposed to leaning into the controller-runtime machinery more as we had been doing previously. Could you tell me more about the decisions made here, if there were difficulties with trying to rely more on the traditional reconciliation approach?

@levikobi
Copy link
Member Author

Thanks for jumping in to do this one @levikobi 🥳

One thing I noticed while reviewing is that we've added a bit of machinery of our own to reconcile things, e.t.c the observer interface and the SetupReconcilation method as opposed to leaning into the controller-runtime machinery more as we had been doing previously. Could you tell me more about the decisions made here, if there were difficulties with trying to rely more on the traditional reconciliation approach?

Thank you for letting me get involved @shaneutt.

Regarding your question, this is my first interaction with the controller-runtime SDK, so I tried doing my best while also keeping in mind that this change might not stay for too long.

I do understand from your comment though that I could have made a better usage out of the controller runtime.😅

If you can give me some tips about improving things, I will happily apply them. Anyway, I will try to go over the documentation again and see what I've missed.

Thanks :)

@shaneutt
Copy link
Member

Thanks for jumping in to do this one @levikobi partying_face
One thing I noticed while reviewing is that we've added a bit of machinery of our own to reconcile things, e.t.c the observer interface and the SetupReconcilation method as opposed to leaning into the controller-runtime machinery more as we had been doing previously. Could you tell me more about the decisions made here, if there were difficulties with trying to rely more on the traditional reconciliation approach?

Thank you for letting me get involved @shaneutt.

Regarding your question, this is my first interaction with the controller-runtime SDK, so I tried doing my best while also keeping in mind that this change might not stay for too long.

I do understand from your comment though that I could have made a better usage out of the controller runtime.sweat_smile

If you can give me some tips about improving things, I will happily apply them. Anyway, I will try to go over the documentation again and see what I've missed.

Thanks :)

We really appreciate you jumping in here!

We use operator-sdk (which uses kubebuilder under the hood, which then uses controller-runtime under the hood) and you can have that generate controllers for you that will provide a standard Reconcile() callback and load it into your controller manager.

I wouldn't say we're strictly opposed to the solution you've put forth here, but I do think it changes how our controllers work in a way that handles more of the reconciliation machinery more manually than I would have expected. If we continued using our previous pattern I think we could just add one new controller that watches the DaemonSet for the containers to change and update the dataplane client with any new changes. Overall:

  1. make the dataplane client support multiple backends
  2. provide threadsafe update functionality for the dataplane client
  3. add a new controller which responds changes in Pods owned by the Daemonset and updates the dataplane client accordingly
  4. ensure updates to the dataplane client's backends result in triggering a new configuration push

I think ideally we would try to do all this within the confines of the higher level mechanisms (reconciler callback types, and watches), though I admit number 4 is the thing that makes this hard to do. I think I would have to experiment a bit with the various watch sources provided by controller-runtime: using a channel source seems like it could be one reasonably simple way to do it, all dataplane client updates resulting in a channel message that triggers re-reconcilation of TCPRoutes/UDPRoutes, though that has some efficiency issues and goes a bit against the grain of what the channel source was meant for.

How about this: do some thinking and looking around let me know if anything stands out to you, and I'll do the same plus also do another round of review here and see if I can come up with anything. Ultimately I don't want you to feel like "you stepped in it" with this one like you're getting pulled into more than you bargained for, so just know that I'm prioritizing your time as a contributor as well as working through some maintenance concerns.

@levikobi
Copy link
Member Author

Thanks for the honest feedback @shaneutt! I understand.
I will finish my prior commitments at Gateway-API and ingress2gateway before the release date and come back to this one afterward to make things right.

@levikobi
Copy link
Member Author

Hi @shaneutt, I hope this version better fits your intentions.
Waiting for your feedback 🙏🏻

@levikobi levikobi force-pushed the client-endpoint-management branch from 4c74635 to 8d65029 Compare August 17, 2023 08:21
Copy link
Member

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass of review, mostly I think we're looking pretty solid. I haven't dug deep into the route controllers yet. Most of my comments so far are minor, I do think the channel approach should be a workable solution thank you for putting the time and energy into this we really appreciate it! I'll try to get to my next pass soon.

controllers/dataplane_controller.go Outdated Show resolved Hide resolved
controllers/dataplane_controller.go Show resolved Hide resolved
controllers/dataplane_controller.go Outdated Show resolved Hide resolved
controllers/dataplane_controller.go Outdated Show resolved Hide resolved
controllers/dataplane_controller.go Show resolved Hide resolved
main.go Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
@levikobi levikobi force-pushed the client-endpoint-management branch from 4b3003b to 8cc1915 Compare August 19, 2023 05:46
Copy link
Member

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So keeping in mind that this functionality is not meant to exist long term, I think this is close to being ready and we can move forward with this. I added a comment about a TODO we should resolve or follow-up on. Also @astoycos should take a look too since this is a fairly large change (and there's multiple ways we could approach this) and two heads are better than one.

At a higher level though, it seems we're missing an integration/e2e style test that validates these changes. One way we might do this is make the default kind cluster that's created for the tests include 3 nodes instead of 1, but by default cordon off 2/3 of them.

Our testing framework provides a WithConfig() option for the kind cluster builder:

https://github.com/Kong/kubernetes-testing-framework/blob/main/pkg/clusters/types/kind/builder.go#L54

I think it might be reasonable for us to provide a 3-node configuration to that in an e2e test that then cordons off the nodes, then removes the cordon to instigate the Daemonset deleting, and then creating the pods, and then we could verify the configuration (for TCPRoute/UDPRoute) gets applied to the Pods as they come up, and we could test the traffic and make sure things are working as expected?

Let me know your thoughts on this @levikobi, and if that sounds like a lot or if it sounds pretty straightforward? Curious if you have any other thoughts on testing as well. 🤔

internal/dataplane/client/client.go Show resolved Hide resolved
@shaneutt
Copy link
Member

There are also some lint issues that need to be addressed, according to CI:

  Error: : # github.com/kong/blixt
  Error: ./main.go:95:54: undefined: Tee (typecheck)

Copy link
Member

@astoycos astoycos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @levikobi, thanks so much for tackling this :)

I generally have a bunch of nits and a few questions otherwise I think this is a pretty cool design 👍. Lets just make sure we try and keep as clear of a separation as possible between operator-sdk enabled funcations/vars/uses and the custom channel logic you've added here with comments and smart variable naming. I just push this to not confuse folks in the future.

Additionally we are adding a little complexity so I'd like to make sure we have at least some unit/integration testing added to make sure all of the channel functionality is working as expected. Feel free to make an issue for this and tackle it in a follow up or ask someone else to :)

@@ -35,7 +36,9 @@ type TCPRouteReconciler struct {
client.Client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer neeed RBAC for daemonset's here right? (on the //+kubebuilder lines above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure that all controller specific RBAC tags are aligned to each dedicated controller file and then any RBAC tags that are used in all of the controllers maybe goes in the dataplane.

Sorry I know it wasn't really a problem you introduced :/


return ctrl.NewControllerManagedBy(mgr).
For(&appsv1.DaemonSet{},
builder.WithPredicates(predicate.NewPredicateFuncs(r.daemonsetHasMatchingAnnotations)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you possibly filter this even more? i.e Only care about events where the DS's status changes since it's where we get the ready pods from anyways?

status:
  currentNumberScheduled: 1
  desiredNumberScheduled: 1
  numberAvailable: 1
  numberMisscheduled: 0
  numberReady: 1
  observedGeneration: 1
  updatedNumberScheduled: 1

Copy link
Member Author

@levikobi levikobi Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still relatively new to the tooling, did you mean something like that?

return ctrl.NewControllerManagedBy(mgr).
	For(&appsv1.DaemonSet{},
		builder.WithPredicates(predicate.NewPredicateFuncs(r.daemonsetHasMatchingAnnotations)),
	).
	WithEventFilter(predicate.Funcs{
		UpdateFunc: func(e event.UpdateEvent) bool {
			return true
		},
	}).
	Complete(r)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah close to that see https://sdk.operatorframework.io/docs/building-operators/golang/references/event-filtering/

In bpfd we do something like

// Only reconcile if a bpfprogram has been created for the controller's program type.
func BpfProgramTypePredicate(kind string) predicate.Funcs {
	return predicate.Funcs{
		GenericFunc: func(e event.GenericEvent) bool {
			return e.Object.(*bpfdiov1alpha1.BpfProgram).Spec.Type == kind
		},
		CreateFunc: func(e event.CreateEvent) bool {
			return e.Object.(*bpfdiov1alpha1.BpfProgram).Spec.Type == kind
		},
		UpdateFunc: func(e event.UpdateEvent) bool {
			return e.ObjectNew.(*bpfdiov1alpha1.BpfProgram).Spec.Type == kind
		},
		DeleteFunc: func(e event.DeleteEvent) bool {
			return e.Object.(*bpfdiov1alpha1.BpfProgram).Spec.Type == kind
		},
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which is a bit verbose but has worked for us in the past)

main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
controllers/udproute_controller.go Outdated Show resolved Hide resolved
controllers/tcproute_controller.go Show resolved Hide resolved
@levikobi
Copy link
Member Author

So keeping in mind that this functionality is not meant to exist long term, I think this is close to being ready and we can move forward with this. I added a comment about a TODO we should resolve or follow-up on. Also @astoycos should take a look too since this is a fairly large change (and there's multiple ways we could approach this) and two heads are better than one.
At a higher level though, it seems we're missing an integration/e2e style test that validates these changes. One way we might do this is make the default kind cluster that's created for the tests include 3 nodes instead of 1, but by default cordon off 2/3 of them.
Our testing framework provides a WithConfig() option for the kind cluster builder:
https://github.com/Kong/kubernetes-testing-framework/blob/main/pkg/clusters/types/kind/builder.go#L54
I think it might be reasonable for us to provide a 3-node configuration to that in an e2e test that then cordons off the nodes, then removes the cordon to instigate the Daemonset deleting, and then creating the pods, and then we could verify the configuration (for TCPRoute/UDPRoute) gets applied to the Pods as they come up, and we could test the traffic and make sure things are working as expected?
Let me know your thoughts on this @levikobi, and if that sounds like a lot or if it sounds pretty straightforward? Curious if you have any other thoughts on testing as well. 🤔

Sound like a good idea, testing is definitely a must for this change.
I'll try to get it done and push it probably during next week

@levikobi
Copy link
Member Author

Awesome work @levikobi, thanks so much for tackling this :)

I generally have a bunch of nits and a few questions otherwise I think this is a pretty cool design 👍. Lets just make sure we try and keep as clear of a separation as possible between operator-sdk enabled funcations/vars/uses and the custom channel logic you've added here with comments and smart variable naming. I just push this to not confuse folks in the future.

Additionally we are adding a little complexity so I'd like to make sure we have at least some unit/integration testing added to make sure all of the channel functionality is working as expected. Feel free to make an issue for this and tackle it in a follow up or ask someone else to :)

Thanks for the review @astoycos!

@shaneutt
Copy link
Member

Sound like a good idea, testing is definitely a must for this change. I'll try to get it done and push it probably during next week

Ok, reach out to me if you need any help with this there's some room for interpretation here, it's not terribly straightforward.

@shaneutt shaneutt marked this pull request as draft October 10, 2023 20:12
@shaneutt
Copy link
Member

Hey! 👋

Just checking in. Need any help to move this one forward?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@levikobi levikobi force-pushed the client-endpoint-management branch from d711cda to c5b10cf Compare October 26, 2023 18:13
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 26, 2023
@levikobi levikobi force-pushed the client-endpoint-management branch 2 times, most recently from 439805b to 2b2324f Compare October 26, 2023 18:38
@levikobi levikobi force-pushed the client-endpoint-management branch from 2b2324f to 28f2ddb Compare October 27, 2023 05:25
@levikobi levikobi marked this pull request as ready for review October 27, 2023 05:52
Copy link
Member

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @astoycos and rather than hold this PR up and making it bigger (since it's already pretty good size) we're aligned on taking what we have here, and following up (thanks for creating follow-ups like #127).

So we'll follow up with some tweaks and tests, but so far integration is passing, thank you for all your effort on this @levikobi! 🖖

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: levikobi, shaneutt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2023
@shaneutt shaneutt added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 27, 2023
@k8s-ci-robot k8s-ci-robot merged commit 8bb3a4e into kubernetes-sigs:main Oct 27, 2023
9 checks passed
@levikobi levikobi deleted the client-endpoint-management branch October 27, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/controlplane area/dataplane cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. feature New feature or request lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Development

Successfully merging this pull request may close these issues.

Multiple grpc client endpoint management
5 participants