Clean up Endpoints object #16

andrerun · 2024-03-28T08:52:17Z

Change description:
GCMx, upon exiting leader role, now opportunistically deletes the service Endpoints object it configured for itself when it entered leadership role.

Notes:
If you want to test this function on a GCMx which was launched via scaffold debug, deleting the pod won't work, because PID=1 would be DLV, not GCMx. You'd need to either terminate leadership, or directly send SIG_TERM to the GCMx process. With a pod which was not launched via skaffold/DLV, you can just delete the pod.

Before this change, HAService.Run() used to exit quickly - it only needed to arrange the Endpoints object. After the change, HAService.Run() blocks for the duration of the program execution, because it is not responsible for the cleanup upon program termination. This is reflected by a change in the unit tests.

ialidzhikov · 2024-03-28T12:24:25Z

/assign

ialidzhikov

Can you also fix make verify?

cmd/gardener-custom-metrics/main.go

pkg/ha/ha_service.go

ialidzhikov · 2024-03-28T11:23:16Z

pkg/ha/ha_service.go

+	// Also, try to finish before a potential 15 seconds termination grace timeout.
+	ctx, cancel := context.WithTimeout(context.Background(), 14*time.Second)
+	defer cancel()
+	seedClient := ha.manager.GetClient()


Nit: I wouldn't call it a seedClient. In all other places we call it client. Tomorrow, we might need to support the runtime cluster to scale gardener-apiserver or virtual-kube-apiserver.

Suggested change

seedClient := ha.manager.GetClient()

client := ha.manager.GetClient()

Agreed, but since this program is talking to multiple clusters, I don't want to use "client". I agree with your point that the name will need to change in the future, but at that time I'll also have the the context which will allow me come up with the right genralisation, without resorting to the excessively general (IMO) "client".

ialidzhikov · 2024-03-28T11:27:52Z

pkg/ha/ha_service.go

+	return err
+}
+
+// cleanUp is executed upon ending leadership. Its purpose is to remove the Endpoints object created upon acquiring


AFAIS, the func is called also on start up so it is not completely correct to state that it is only executed upon ending leadership.

In all places where it is executed, a leader position is being held and the process is about to terminate (or I have a bug I'm missing).

pkg/ha/ha_service.go

ialidzhikov · 2024-03-28T11:51:24Z

pkg/ha/ha_service.go

+			err = seedClient.Delete(ctx, &endpoints, deletionPrecondition)
+			if client.IgnoreNotFound(err) == nil {


Suggested change

err = seedClient.Delete(ctx, &endpoints, deletionPrecondition)

if client.IgnoreNotFound(err) == nil {

deletionPrecondition := client.Preconditions{UID: &endpoints.UID, ResourceVersion: &endpoints.ResourceVersion}

if err = seedClient.Delete(ctx, endpoints, deletionPrecondition); client.IgnoreNotFound(err) == nil {

I'm intentionally keeping "delete with precondition" on a separate line here. It's an uncommon construct, which is likely to give the reader a pause, and I don't want to force other logic on the same line.

pkg/ha/ha_service.go

ialidzhikov · 2024-03-28T12:10:05Z

pkg/ha/ha_service.go

+// leadership.
+func (ha *HAService) cleanUp() error {
+	// Use our own context. This function executes when the main application context is closed.
+	// Also, try to finish before a potential 15 seconds termination grace timeout.


From where these 15s of potential termination grace timeout come from?
According to https://github.com/gardener/gardener-custom-metrics/blob/c43b2064794e5534f2a0d7a831285210620f9ed8/example/custom-metrics-deployment.yaml#L72 we should have 30s from the SIGTERM signal until the SIGKILL.

15 is a nice, round number, and also half of the default 30. I'm speculating that upon a hypothetical future shortening of grace period, 15 will be a likely choice (the other obvious choice being 10, of course).

This is not a critical choice. I'm simply picking a value which is likely to work slightly better with potential future changes.

There is an option for the manager which allows to specify the termination grace period for all runnables: GracefulShutdownTimeout and it is defaulted to 30s: https://github.com/kubernetes-sigs/controller-runtime/blob/76d3d0826fa9dca267c70c68c706f6de40084043/pkg/manager/internal.go#L55
Not sure if it makes sense to use a (doubly) short time for this function.

Either way, if you have a strong reason to not specify the default or not make it configurable, and keep it 15s can you please add the reason as a comment. Otherwise people will wonder where the magic number comes from.

pkg/ha/ha_service.go

plkokanov

On a first read-through I am a bit concerned about race conditions between multiple replicas if HA is used.

IIRC the motivation to add the cleanup here was so that we don't add a new rbac for gardenlet to access endpoints.
However, couldn't we let the gardener-resource-manager simply deploy an empty endpoint together with the service as part of the ManagedResource that will deploy GCMx (GRM has access to all resource kinds). The mr would then delete the endpoint when GCMx is uninstalled.

pkg/ha/ha_service.go

plkokanov · 2024-04-05T12:33:53Z

pkg/ha/ha_service.go

+// leadership.
+func (ha *HAService) cleanUp() error {
+	// Use our own context. This function executes when the main application context is closed.
+	// Also, try to finish before a potential 15 seconds termination grace timeout.


There is an option for the manager which allows to specify the termination grace period for all runnables: GracefulShutdownTimeout and it is defaulted to 30s: https://github.com/kubernetes-sigs/controller-runtime/blob/76d3d0826fa9dca267c70c68c706f6de40084043/pkg/manager/internal.go#L55
Not sure if it makes sense to use a (doubly) short time for this function.

Either way, if you have a strong reason to not specify the default or not make it configurable, and keep it 15s can you please add the reason as a comment. Otherwise people will wonder where the magic number comes from.

plkokanov · 2024-04-05T13:03:59Z

pkg/ha/ha_service.go

+
+	attempt := 0
+	var err error
+	for {


I'd personally prefer to use Poll<...> here unless there is a strong argument for the max number of attempts. Then I would stick with the for loop.
Generally, one benefit is that Poll<...> should already be tested. Another is that when someone tries to do a similar wait and sees the for{...} loop he might decide to copy it and change it a bit, instead of simply reusing the Poll<...> function. As for domain-specificity - I think both GCMx and the functions in the https://github.com/kubernetes/kubernetes/blob/v1.28.0/staging/src/k8s.io/apimachinery/pkg/util/wait/poll.go share the same domain.

plkokanov · 2024-04-05T13:34:13Z

pkg/ha/ha_service.go

+		if attempt >= 10 {
+			break
+		}
+		time.Sleep(1 * time.Second)


What about using a time.NewTicker, or better yet clock.RealClock.NewTicker which allows you to use a Clock interface that can be mocked for tests.
Tickers take into account the time that was actually spent while executing the Get/Update calls. Additionally, since you will have to add a select statement for it, you can use the select to check if the context has expired meanwhile

ialidzhikov

Could you please fix the failing make verify?
Could you rebase and then add to the example RBAC a rule for deleting endpoints?

andrerun requested a review from a team as a code owner March 28, 2024 08:52

gardener-robot added needs/review Needs review size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Mar 28, 2024

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 28, 2024

gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 28, 2024

gardener-robot assigned ialidzhikov Mar 28, 2024

ialidzhikov reviewed Mar 28, 2024

View reviewed changes

ialidzhikov changed the title ~~Cleanup Endpoints object~~ Clean up Endpoints object Mar 28, 2024

plkokanov reviewed Apr 2, 2024

View reviewed changes

pkg/ha/ha_service.go Outdated Show resolved Hide resolved

Cleanup Endpoints object

f50dda3

andrerun force-pushed the endpoints-cleanup branch from 8326c95 to 2faea7b Compare April 4, 2024 11:13

gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 4, 2024

Apply review comments.

4335f94

andrerun force-pushed the endpoints-cleanup branch from 2faea7b to 4335f94 Compare April 4, 2024 12:21

gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 4, 2024

plkokanov reviewed Apr 5, 2024

View reviewed changes

ialidzhikov reviewed Apr 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up Endpoints object #16

Clean up Endpoints object #16

andrerun commented Mar 28, 2024

ialidzhikov commented Mar 28, 2024

ialidzhikov left a comment

ialidzhikov Mar 28, 2024

andrerun Apr 4, 2024

ialidzhikov Mar 28, 2024

andrerun Apr 4, 2024

ialidzhikov Mar 28, 2024

andrerun Apr 4, 2024

ialidzhikov Mar 28, 2024

andrerun Apr 4, 2024 •

edited

Loading

plkokanov Apr 5, 2024

plkokanov left a comment •

edited

Loading

plkokanov Apr 5, 2024

plkokanov Apr 5, 2024

plkokanov Apr 5, 2024

ialidzhikov left a comment

	seedClient := ha.manager.GetClient()
	client := ha.manager.GetClient()

		err = seedClient.Delete(ctx, &endpoints, deletionPrecondition)
		if client.IgnoreNotFound(err) == nil {

Clean up Endpoints object #16

Are you sure you want to change the base?

Clean up Endpoints object #16

Conversation

andrerun commented Mar 28, 2024

ialidzhikov commented Mar 28, 2024

ialidzhikov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrerun Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plkokanov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ialidzhikov left a comment

Choose a reason for hiding this comment

andrerun Apr 4, 2024 •

edited

Loading

plkokanov left a comment •

edited

Loading