Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check resource generation when processing updates of some resources to skip config regeneration #1422

Merged
merged 7 commits into from
Jan 18, 2024

Conversation

kevin85421
Copy link
Contributor

@kevin85421 kevin85421 commented Dec 22, 2023

Proposed changes

Because I lack sufficient context for this repository, I decided to approach this PR step-by-step and ensure each step is correct.

Requirement 1: Move generation check from ChangeProcessor to Controllers, for the types we already check generation

  • Step 1: Move generation check from ChangeProcessor to Controllers, for the types we already check generation 0d20c61

    • Remove type generationChangedPredicate struct{}. (Update: Add it back in Step 2. See Step 2 for more details.)
    • In change_processor.go, there are four kinds of Kubernetes resources that use generationChangedPredicate. These include GatewayClass, Gateway, HTTPRoute, and ReferenceGrant.
    • In this commit, I only add controller.WithK8sPredicate(k8spredicate.GenerationChangedPredicate{}) for these four Kubernetes resources in manager.go in this commit.
  • Step 2: Remove tests that do not involve a generation change from change_processor_test.go. (9e01812)

    • change_processor.go should not handle any cases with the same generation, as these should already be filtered out by the controller's GenerationChangedPredicate.
    • In Step 1, I removed type generationChangedPredicate struct{} and replaced the predicate for the four Kubernetes resources in change_processor.go with nil. However, predicate: nil indicates that the state change will not trigger a graph rebuild. See [Bug] Remove unused data from persistedGVKs #1358 for more details. This behavior is incompatible with the old behavior. Hence, I added type generationChangedPredicate struct{} back and made it always return true to keep the behavior compatible. We may need to remove it in the future.

Requirement 2: Figure out which types miss generation check and add it.

TL;DR: There is no Kubernetes resources that miss generation check.

"Requirement 1" is entirely equivalent to the existing behavior, and "Requirement 2" introduces new behavior.

  • GatewayClass: The existing GatewayClassPredicate filters out all events that are not using NGF as the controller by checking Spec.ControllerName. Add k8spredicate.GenerationChangedPredicate to GatewayClass. See Check resource generation when processing updates of some resources to skip config regeneration #1422 (comment) for more details.
  • Gateway: Add k8spredicate.GenerationChangedPredicate to Gateway. See Check resource generation when processing updates of some resources to skip config regeneration #1422 (comment) for more details.
  • HTTPRoute: Add k8spredicate.GenerationChangedPredicate to HTTPRoute. The behavior remains equivalent, with or without this PR. See "Requirement 1" for more details.
  • Service:
    • The existing ServicePortsChangedPredicate filters out all events that do not have changes in Spec.Ports. If ServicePortsChangedPredicate is true, k8spredicate.GenerationChangedPredicate should always be true. Hence, we don't need to add k8spredicate.GenerationChangedPredicate.
    • The existing GatewayServicePredicate filters out all events without changes in Spec.Type and Status.LoadBalancer.Ingress. As I understand it, changes in Status do not increment the generation value. Therefore, k8spredicate.GenerationChangedPredicate might filter out some changes related to Status.LoadBalancer.Ingress. Hence, we should not add the predicate for this case.
    • It seems that the Kubernetes service does not have a metadata.generation field. I am not 100% sure, but it's highly likely.
      kubectl get svc $SVC_NAME -o yaml | grep generation # nothing should be shown.
  • Secret:
    • It seems that the Kubernetes Secret does not have a metadata.generation field. I am not 100% sure, but it's highly likely.
      kubectl get secret $SECRET_NAME -o yaml | grep generation # nothing should be shown.
  • EndpointSlice: The k8spredicate.GenerationChangedPredicate already exists.
  • Namespace:
    • It seems that the Kubernetes Namespace does not have a metadata.generation field. I am not 100% sure, but it's highly likely.
      kubectl get namespace $NS_NAME -o yaml | grep generation # nothing should be shown.
  • CRD
    • The existing AnnotationPredicate filters out all events without changes in metadata.Annotations. Changes to annotations on a custom resource will not increment the metadata.generation. Therefore, k8spredicate.GenerationChangedPredicate might filter out events that only involve changes in metadata.Annotations.

Closes #825

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

@bjee19
Copy link
Contributor

bjee19 commented Dec 26, 2023

Hey @kevin85421, saw your comment here and just wanted to give you a heads up that the majority of the engineers on the team are currently on holiday until January 2nd, @sjberman may be able to take a look tomorrow, but if not, this may just sit for a little while. Hope that's alright!

@kevin85421
Copy link
Contributor Author

@bjee19 Thanks for letting me know! Happy holidays!

@sjberman
Copy link
Contributor

sjberman commented Jan 3, 2024

I'd just use one PR for all of these changes, and ensure the title of the PR defines what the high level fix is.

@sjberman
Copy link
Contributor

sjberman commented Jan 3, 2024

At first glance I think the approach looks good.

Copy link
Contributor

@kate-osborn kate-osborn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 thanks for working on this!

I left a few comments, but I think you are on the right track 👍

internal/mode/static/state/changed_predicate.go Outdated Show resolved Hide resolved
@kevin85421 kevin85421 force-pushed the check-generation branch 2 times, most recently from d14b413 to 5206313 Compare January 8, 2024 09:11
@kevin85421 kevin85421 changed the title [WIP] Move generation check from change_processor to controller Move generation check from change_processor to controller Jan 8, 2024
@kevin85421
Copy link
Contributor Author

Hi @sjberman @kate-osborn, I have already addressed the comments. In addition, there are no Kubernetes resources that miss the generation check, so there is no commit for "Requirement 2". See the PR description for more details. Could you provide a pointer on where to add tests, and are there any existing tests that I can refer to? Thanks!

@kevin85421 kevin85421 marked this pull request as ready for review January 8, 2024 09:21
@kevin85421 kevin85421 requested a review from a team as a code owner January 8, 2024 09:21
@sjberman sjberman added the enhancement New feature or request label Jan 8, 2024
@sjberman
Copy link
Contributor

sjberman commented Jan 8, 2024

@kevin85421 If you could update the initial commit message to include the problem/solution structure, this will make it easier on our team when we merge it in. This way you can supply the commit message and we don't need to write one for you.

@kevin85421 kevin85421 changed the title Move generation check from change_processor to controller Check resource generation when processing updates of some resources to skip config regeneration Jan 8, 2024
@github-actions github-actions bot removed the enhancement New feature or request label Jan 8, 2024
@kevin85421
Copy link
Contributor Author

@kevin85421 If you could update the initial commit message to include the problem/solution structure, this will make it easier on our team when we merge it in. This way you can supply the commit message and we don't need to write one for you.

Thanks! I have updated the commit message. I originally planned to update it after adding enough tests 😅.

Copy link
Contributor

@kate-osborn kate-osborn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 thanks for the details in the PR description.

The changes look good to me, I just left one small nitpicky comment.

As far as testing goes, we don't have any unit tests for the registerControllers function because they don't add a lot of value. We would just be checking that we adding the correct options for each objectType.

In order to test this PR, I think we need to do the following:

  1. Run the conformance tests. I will kick off the pipeline which will run the conformance tests, so you don't need to run them locally.
  2. Manually verify that we don't reconcile events for GatewayClasses, Gateways, HTTPRoutes, and RefGrants when the generation doesn't change. This can be verified through logs ( You'll want to look for the log that starts with "Reconciling the resource" ) I would also double check that the GatewayClass And predicate works as expected.

@pleshakov does this seem like a good approach or do you think it's worth it to add a unit test for the registerControllers function?

internal/mode/static/manager.go Outdated Show resolved Hide resolved
@pleshakov
Copy link
Contributor

@kate-osborn

In order to test this PR, I think we need to do the following:

Run the conformance tests. I will kick off the pipeline which will run the conformance tests, so you don't need to run them locally.
Manually verify that we don't reconcile events for GatewayClasses, Gateways, HTTPRoutes, and RefGrants when the generation doesn't change. This can be verified through logs ( You'll want to look for the log that starts with "Reconciling the resource" ) I would also double check that the GatewayClass And predicate works as expected.
@pleshakov does this seem like a good approach or do you think it's worth it to add a unit test for the registerControllers function?

that makes sense to me

If the controllers still reconcile when there is no gen change (because of the bug), will it trigger rebuild of the graph and reload of NGINX configuration?
If that is the case, would it also make sense to create an issue to add test automation for that in the future ? Note we don't yet have functional tests yet written but we're planning to add them in the future. I think having such tests will help along with the tests for other reload-avoiding-related optimizations we're planning to implement -- #1112 , #1123, #1124

@kevin85421
Copy link
Contributor Author

kevin85421 commented Jan 12, 2024

Thanks, @kate-osborn and @pleshakov! I have just written down how I test this PR manually. Please let me know if this is not sufficient. Thanks!

  • Step 1: Install NGF, and a GatewayClass, which is a non-namespaced resource, will also be created.

  • Step 2: Create 3 * HTTPRoute, 1 * ReferenceGrant, and 1 * Gateway by running cafe-routes.yaml, gateway.yaml, and reference-grant.yaml.

  • Step 3: Run this script to list the information of generation for all GatewayClass/Gateway/HTTPRoute/ReferenceGrant.

    [Non-namespaced resource]
    -----------------------------------
    GatewayClasses:
    nginx   1
    
    
    
    [Namespaced resources]
    -----------------------------------
    Namespace: certificate
    ReferenceGrants:
    access-to-cafe-secret   1
    -----------------------------------
    Namespace: default
    Gateways:
    gateway   1
    HttpRoutes:
    cafe-tls-redirect   1
    coffee              1
    tea                 1
    -----------------------------------
    Namespace: kube-node-lease
    -----------------------------------
    Namespace: kube-public
    
  • Step 4: Add an annotation to all GatewayClass, Gateway, HTTPRoute, and ReferenceGrant objects. Then, repeat Step 3; the generation should not change.

  • Step 5: Wait for a while. Check here if you are interested in my log file.

    kubectl logs -n nginx-gateway my-release-nginx-gateway-fabric-67dfb666b9-jb7tk | tee log
  • Step 6: As shown in Step 3, the generation of every related resource is 1. The number of reconciliations for each related Kubernetes resource type is equal to the count of resources of that type. Check the log file:

    • "Reconciling the resource","controller":"httproute" => 3
    • "Reconciling the resource","controller":"referencegrant" => 1
    • "Reconciling the resource","controller":"gateway" => 1
    • "Reconciling the resource","controller":"gatewayclass" => 1
    • The result is as my expectation.
  • Step 7: Check Prometheus metrics. See my metrics for more details.

    controller_runtime_reconcile_total{controller="httproute",result="error"} 0
    controller_runtime_reconcile_total{controller="httproute",result="requeue"} 0
    controller_runtime_reconcile_total{controller="httproute",result="requeue_after"} 0
    controller_runtime_reconcile_total{controller="httproute",result="success"} 3
    ...
    controller_runtime_reconcile_total{controller="gateway",result="error"} 0
    controller_runtime_reconcile_total{controller="gateway",result="requeue"} 0
    controller_runtime_reconcile_total{controller="gateway",result="requeue_after"} 0
    controller_runtime_reconcile_total{controller="gateway",result="success"} 1
    ...
    controller_runtime_reconcile_total{controller="gatewayclass",result="error"} 0
    controller_runtime_reconcile_total{controller="gatewayclass",result="requeue"} 0
    controller_runtime_reconcile_total{controller="gatewayclass",result="requeue_after"} 0
    controller_runtime_reconcile_total{controller="gatewayclass",result="success"} 1
    ...
    controller_runtime_reconcile_total{controller="referencegrant",result="error"} 0
    controller_runtime_reconcile_total{controller="referencegrant",result="requeue"} 0
    controller_runtime_reconcile_total{controller="referencegrant",result="requeue_after"} 0
    controller_runtime_reconcile_total{controller="referencegrant",result="success"} 1
    

@pleshakov
Copy link
Contributor

Hi @kevin85421

Step 4: Wait for a while. In my experiment, I wait for 40 minutes. Check here if you are interested in my log file.

I wonder why we wait for 40 mins. Can we wait for less time?

I think we can avoid waiting at all if we try to change GatewayClass/Gateway/HTTPRoute/ReferenceGrant in the API without changing their generation. We can do that by adding or updating an annotation on the resource - such operation doesn't change resource generation.

For example, if I add an annotation to an HTTPRoute resource:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  annotations:
    hello: hello
 ...

However, note that metadata.resourceVersion will change those after such an update.

Currently, I see the following in the logs:

{"level":"info","ts":"2024-01-12T17:05:31Z","msg":"Reconciling the resource","controller":"httproute","controllerGroup":"gateway.networking.k8s.io","controllerKind":"HTTPRoute","HTTPRoute":{"name":"coffee","namespace":"default"},"namespace":"default","name":"coffee","reconcileID":"35e96c96-5dff-42c6-b05c-90fdec385e70"}
{"level":"info","ts":"2024-01-12T17:05:31Z","msg":"Upserted the resource","controller":"httproute","controllerGroup":"gateway.networking.k8s.io","controllerKind":"HTTPRoute","HTTPRoute":{"name":"coffee","namespace":"default"},"namespace":"default","name":"coffee","reconcileID":"35e96c96-5dff-42c6-b05c-90fdec385e70"}
{"level":"info","ts":"2024-01-12T17:05:31Z","logger":"eventLoop","msg":"added an event to the next batch","type":"*events.UpsertEvent","total":1}
{"level":"info","ts":"2024-01-12T17:05:31Z","logger":"eventLoop.eventHandler","msg":"Handling events from the batch","batchID":27,"total":1}
{"level":"info","ts":"2024-01-12T17:05:31Z","logger":"eventLoop.eventHandler","msg":"Handling events didn't result into NGINX configuration changes","batchID":27}
{"level":"info","ts":"2024-01-12T17:05:31Z","logger":"eventLoop.eventHandler","msg":"Finished handling the batch","batchID":27}

Which should not happen as a result of your PR.

We can also look at Prometheus metrics (https://docs.nginx.com/nginx-gateway-fabric/how-to/monitoring/monitoring/#controller-runtime-metrics ). Their include controller-runtime metrics. So that we can check resource reconciliation count. For example:

controller_runtime_reconcile_total{controller="httproute",result="error"} 0
controller_runtime_reconcile_total{controller="httproute",result="requeue"} 0
controller_runtime_reconcile_total{controller="httproute",result="requeue_after"} 0
controller_runtime_reconcile_total{controller="httproute",result="success"} 9

I expect the following to not change if our controllers successfully filter out generation changes.:

controller_runtime_reconcile_total{controller="httproute",result="success"} 9

@kate-osborn
Copy link
Contributor

kate-osborn commented Jan 12, 2024

If the controllers still reconcile when there is no gen change (because of the bug), will it trigger rebuild of the graph and reload of NGINX configuration? If that is the case, would it also make sense to create an issue to add test automation for that in the future ?

@pleshakov Yes, it would trigger a rebuild and reload. I think an automated test is a good idea. I added an issue: #1463

@kevin85421
Copy link
Contributor Author

@pleshakov Thank you for the reply!

I wonder why we wait for 40 mins. Can we wait for less time?

No. I just want to ensure the NGF controller doesn't requeue k8s resources unconditionally.

I have already updated #1422 (comment). I added annotations to all related Kubernetes resources and checked the Prometheus metrics.

@kevin85421
Copy link
Contributor Author

Gentle ping - Do the tests in #1422 (comment) make sense to you? cc @kate-osborn @pleshakov Thanks!

Copy link
Contributor

@kate-osborn kate-osborn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@kate-osborn
Copy link
Contributor

Gentle ping - Do the tests in #1422 (comment) make sense to you? cc @kate-osborn @pleshakov Thanks!

Sorry for the delay, @kevin85421. The test results look good! Just approved ✅ Do you mind rebasing?

Problem: When processing updates to cluster resources, for some
resources, we check their generation, so that we don't trigger state
change (graph rebuild) if the generation didn't change. This is a
performance optimization so that we don't rebuild the graph and as a
result do not regenerate NGINX config and reload it. This is not a
K8s-native solution, and may introduce complexity for the codebase.

Solution: Use `GenerationChangedPredicate` in controller-runtime to
filter out the resource events at the controller level.
…cate

Use the `And` controller-runtime function to create a composite predicate.
Rename generationChangedPredicate to alwaysTruePredicate
Fix coding-style issue
@kevin85421
Copy link
Contributor Author

Do you mind rebasing?

Thank you for the prompt reply! I have already rebased the branch with the main branch.

@kate-osborn kate-osborn requested a review from pleshakov January 17, 2024 22:03
@pleshakov pleshakov merged commit e738ab0 into nginxinc:main Jan 18, 2024
27 checks passed
@pleshakov
Copy link
Contributor

@kevin85421 thanks!!

@lucacome lucacome added the enhancement New feature or request label Mar 13, 2024
@pleshakov pleshakov mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Check resource generation when processing updates of some resources to skip config regeneration
6 participants