OCPBUGS-33750: Fix DNSNameResolver object status update issue #8

arkadeepsen · 2024-06-05T05:59:41Z

This PR fixes the issue with DNSNameResolver object status update.

The DNSNameResolver controller was sending DNS requests for a DNS name whose IP has expired at an interval 1ms. As the update event of the DNSNameResolver object is not received by the DNSNameResolver controller within the next 1ms, the controller again sent the DNS request. This resulted in the creation of numerous DNS requests within a very short period of time. The interval is changed to 2 times of the default minimum TTL (5 seconds). This interval will account for the time required for getting the update event by the controller as well as the grace period (5 seconds) to remove any IP address whose TTL has expired. This will avoid sending any extra DNS requests after the first DNS request post the TTL expiration. Additionally, if for any DNS name the latest queries did not return any new address which results in removal of the associated IP addresses after TTL expiration, then the next lookup time for such DNS names is set to default maximum TTL (30 minutes). This will avoid creation of DNS requests for the DNS name at an interval of 2 times of the default minimum TTL.

On the CoreDNS plugin side, the RetryOnConflict block was using the lister to get the DNSNameResolver object and then the client to update the status. However, if a conflict occurs the lister does not get the updated object immediately and the update again fails due to conflict. To minimize the conflict error on update, the client will be used to get the latest object instead of the lister if the resourceVersion of the DNSNameResolver object is same as the previous GET call (inspired by openshift/library-go#1668).

openshift-ci-robot · 2024-06-05T05:59:47Z

@arkadeepsen: This pull request references Jira Issue OCPBUGS-33750, which is invalid:

expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR fixes the issue with DNSNameResolver object status update.

The DNSNameResolver controller was sending DNS requests for a DNS name whose IP has expired at an interval 1ms. As the update event of the DNSNameResolver object is not received by the DNSNameResolver controller within the next 1ms, the controller again sent the DNS request. This resulted in the creation of numerous DNS requests within a very short period of time. The interval is changed to 2 times of the default minimum TTL (5 seconds). This interval will account for the time required for getting the update event by the controller as well as the grace period (5 seconds) to remove any IP address whose TTL has expired. This will avoid sending any extra DNS requests after the first DNS request post the TTL expiration. Additionally, if for any DNS name the latest queries did not return any new address which results in removal of the associated IP addresses after TTL expiration, then the next lookup time for such DNS names is set to default maximum TTL (30 minutes). This will avoid creation of DNS requests for the DNS name at an interval of 2 times of the default minimum TTL.

On the CoreDNS plugin side, the RetryOnConflict block was using the cache to get the DNSNameResolver object and then the client to update the status. However, if a conflict occurs the cache does not get the updated object immediately and the update again fails due to conflict. To minimize the conflict error on update, the client will be used to get the latest object instead of the cache.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

arkadeepsen · 2024-06-05T06:00:38Z

/jira refresh

openshift-ci-robot · 2024-06-05T06:00:44Z

@arkadeepsen: This pull request references Jira Issue OCPBUGS-33750, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

arkadeepsen · 2024-06-05T07:31:16Z

openshift/release#52809 updates the golang version to 1.22 for the build_root in the CI config. Adding a hold until it merges.
/hold

candita · 2024-06-05T16:40:01Z

/assign @alebedev87
/assign

openshift-ci-robot · 2024-06-07T11:34:06Z

@arkadeepsen: This pull request references Jira Issue OCPBUGS-33750, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

In response to this:

This PR fixes the issue with DNSNameResolver object status update.

The DNSNameResolver controller was sending DNS requests for a DNS name whose IP has expired at an interval 1ms. As the update event of the DNSNameResolver object is not received by the DNSNameResolver controller within the next 1ms, the controller again sent the DNS request. This resulted in the creation of numerous DNS requests within a very short period of time. The interval is changed to 2 times of the default minimum TTL (5 seconds). This interval will account for the time required for getting the update event by the controller as well as the grace period (5 seconds) to remove any IP address whose TTL has expired. This will avoid sending any extra DNS requests after the first DNS request post the TTL expiration. Additionally, if for any DNS name the latest queries did not return any new address which results in removal of the associated IP addresses after TTL expiration, then the next lookup time for such DNS names is set to default maximum TTL (30 minutes). This will avoid creation of DNS requests for the DNS name at an interval of 2 times of the default minimum TTL.

On the CoreDNS plugin side, the RetryOnConflict block was using the lister to get the DNSNameResolver object and then the client to update the status. However, if a conflict occurs the lister does not get the updated object immediately and the update again fails due to conflict. To minimize the conflict error on update, the client will be used to get the latest object instead of the lister if the resourceVersion of the DNSNameResolver object is same as the previous GET call.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

arkadeepsen · 2024-07-04T07:52:12Z

/hold cancel
openshift/release#52809 is merged

alebedev87 · 2024-07-04T08:59:03Z

handler.go

+						return err
+					}
+					resourceVersion = resolverObj.GetResourceVersion()
+					log.Warningf("lister was stale at resourceVersion=%v, live get showed resourceVersion=%v", listerResourceVersion, resourceVersion)


Do we want the user to do something about it? And can we do something about it? If no, it should be just an Infof.

I'll change it to Infof.

Changed to Infof.

alebedev87 · 2024-07-04T11:41:33Z

operator/controller/dnsnameresolver/resolver.go

+				// A DNS lookup request has been sent upon TTL expiration of the DNS name. Reset the timer to wait until twice of default
+				// minimum TTL to perform the next lookup.
+				timeTillNextLookup = 2 * defaultMinTTL


I'm not sure I'm getting this one. Do we kinda let the expired records to be removed instead of trying to save them resending the dns request?

Also, it would be nice to group the logic of the next lookup computation into a dedicated function which would select on the channels and would give the next lookup time. This would allow us to unit test it and clearly see the use cases.

I'm not sure I'm getting this one. Do we kinda let the expired records to be removed instead of trying to save them resending the dns request?

When the next lookup time is reached, we get a signal on the timer.C channel in line no. 88 . The lookup for the DNS name happens in line no. 92. So when we reach here, the lookup request is already sent and we should wait for the update event for the corresponding DNSNameResolver object.

When we receive the update event, we also remove the expired IP addresses after a grace period of defaultMinTTL during reconciliation. The following code does this work:

coredns-ocp-dnsnameresolver/operator/controller/dnsnameresolver/controller.go

Lines 139 to 153 in a3f67ee

// Check if the grace period is over for some of the IP addresses after the expiration of their respective TTLs. If so,

// remove those IP addresses and update the status of the resource.

if removalOfIPsRequired(&dnsNameResolver.Status) {

if err := r.client.Status().Update(ctx, dnsNameResolver); err != nil {

return reconcile.Result{}, err

}

return reconcile.Result{}, nil

}

// Check if the TTLs of some of the IP addresses have expired. If so, requeue

// the reconcile request after the minimum remaining time until the grace

// period gets over among that of the IP addresses with expired TTLs.

if ttlExpired, remainingTime := reconcileRequired(&dnsNameResolver.Status); ttlExpired {

return reconcile.Result{Requeue: true, RequeueAfter: remainingTime}, nil

}

So, this block of code will not receive the new IPs/TTLs until the expired IPs are removed. If it doesn't wait until 2 * defaultMinTTL, unnecessary lookup requests will again be sent to the CoreDNS pods.

Separated the code for computing time till next lookup into a function and added unit test for the same.

When the next lookup time is reached, we get a signal on the timer.C channel in line no. 88 . The lookup for the DNS name happens in line no. 92. So when we reach here, the lookup request is already sent and we should wait for the update event for the corresponding DNSNameResolver object.

But we get into this condition only when the remaining duration is 0. The case you added confirms this:

{ name: "DNS exists and remaining duration is not greater than 0", dnsExists: true, remainingDuration: 0, expectedTimeTillNextLookup: 2 * defaultMinTTL, },

So something doesn't stick here. The remaining duration says that we need to send a new lookup asap (like it was before this PR) but we delay it to catchup with the updates.

alebedev87 · 2024-07-04T12:06:32Z

handler.go

+				var resolverObj *ocpnetworkapiv1alpha1.DNSNameResolver
+				var resourceVersion string
+				var err error


Suggested change

var resolverObj *ocpnetworkapiv1alpha1.DNSNameResolver

var resourceVersion string

var err error

var (

resolverObj *ocpnetworkapiv1alpha1.DNSNameResolver

resourceVersion string

err error

)

alebedev87 · 2024-07-04T12:23:49Z

operator/controller/dnsnameresolver/resolver.go

+		// If there are no IP addresses associated with the DNS name and the next lookup
+		// time of the DNS name is already past the current time, then reset the next
+		// lookup time to the default maximum TTL.
+		if resolvedName.numIPs == 0 && !time.Now().Before(resolvedName.minNextLookupTime) {
+			resolvedName.minNextLookupTime = time.Now().Add(defaultMaxTTL)
+		}


I suppose that this can be covered in TestResolver. Can we do a test case for this one?

I think so. Let me try to add that test case.

Added test case for the same in TestResolver.

…lver object is different from the previous one in CoreDNS plugin

…controller

openshift-ci · 2024-07-05T16:31:26Z

@arkadeepsen: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

alebedev87 · 2024-07-12T09:37:28Z

We will need to address the increasing complexity of the resolver code as the reviews become challenging. Letting the bug resolution move on.

/lgtm
/approve

openshift-ci · 2024-07-12T09:37:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alebedev87]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-07-12T09:41:37Z

@arkadeepsen: Jira Issue OCPBUGS-33750: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

openshift/cluster-dns-operator#415 is open

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-33750 has not been moved to the MODIFIED state.

In response to this:

This PR fixes the issue with DNSNameResolver object status update.

The DNSNameResolver controller was sending DNS requests for a DNS name whose IP has expired at an interval 1ms. As the update event of the DNSNameResolver object is not received by the DNSNameResolver controller within the next 1ms, the controller again sent the DNS request. This resulted in the creation of numerous DNS requests within a very short period of time. The interval is changed to 2 times of the default minimum TTL (5 seconds). This interval will account for the time required for getting the update event by the controller as well as the grace period (5 seconds) to remove any IP address whose TTL has expired. This will avoid sending any extra DNS requests after the first DNS request post the TTL expiration. Additionally, if for any DNS name the latest queries did not return any new address which results in removal of the associated IP addresses after TTL expiration, then the next lookup time for such DNS names is set to default maximum TTL (30 minutes). This will avoid creation of DNS requests for the DNS name at an interval of 2 times of the default minimum TTL.

On the CoreDNS plugin side, the RetryOnConflict block was using the lister to get the DNSNameResolver object and then the client to update the status. However, if a conflict occurs the lister does not get the updated object immediately and the update again fails due to conflict. To minimize the conflict error on update, the client will be used to get the latest object instead of the lister if the resourceVersion of the DNSNameResolver object is same as the previous GET call (inspired by openshift/library-go#1668).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

arkadeepsen · 2024-07-16T06:26:39Z

/cherrypick release-4.16

openshift-cherrypick-robot · 2024-07-16T06:27:22Z

@arkadeepsen: new pull request created: #13

In response to this:

/cherrypick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot requested review from knobunc and Miciah June 5, 2024 06:00

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 5, 2024

openshift-ci bot requested a review from huiran0826 June 5, 2024 06:00

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2024

arkadeepsen force-pushed the check-not-found-error branch from 0833873 to a7e22da Compare June 5, 2024 09:45

This was referenced Jun 5, 2024

OCPBUGS-33750: UPSTREAM: <carry>: openshift: Bump the version of ocp_dnsnameresolver external plugin openshift/coredns#122

Merged

OCPBUGS-33750: Bump version of DNSNameResolver controller openshift/cluster-dns-operator#415

Merged

openshift-ci bot assigned alebedev87 and candita Jun 5, 2024

arkadeepsen force-pushed the check-not-found-error branch 2 times, most recently from d29acd8 to 133ce95 Compare June 7, 2024 11:32

arkadeepsen force-pushed the check-not-found-error branch 2 times, most recently from c030e52 to 5c054f4 Compare June 13, 2024 06:36

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 4, 2024

alebedev87 reviewed Jul 4, 2024

View reviewed changes

arkadeepsen added 2 commits July 5, 2024 21:29

Prefer lister over client when the resourceVersion of the DNSNameReso…

cac3367

…lver object is different from the previous one in CoreDNS plugin

Update duration of interval after TTL has expired in DNSNameResolver …

cabf2cb

…controller

arkadeepsen force-pushed the check-not-found-error branch from 5c054f4 to cabf2cb Compare July 5, 2024 16:26

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 12, 2024

openshift-merge-bot bot merged commit af651ce into openshift:main Jul 12, 2024
7 checks passed

openshift-cherrypick-robot mentioned this pull request Jul 16, 2024

[release-4.16] OCPBUGS-37078: Fix DNSNameResolver object status update issue #13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-33750: Fix DNSNameResolver object status update issue #8

OCPBUGS-33750: Fix DNSNameResolver object status update issue #8

arkadeepsen commented Jun 5, 2024 •

edited

Loading

openshift-ci-robot commented Jun 5, 2024

arkadeepsen commented Jun 5, 2024

openshift-ci-robot commented Jun 5, 2024

arkadeepsen commented Jun 5, 2024

candita commented Jun 5, 2024

openshift-ci-robot commented Jun 7, 2024

arkadeepsen commented Jul 4, 2024

alebedev87 Jul 4, 2024

arkadeepsen Jul 4, 2024

arkadeepsen Jul 5, 2024

alebedev87 Jul 4, 2024

alebedev87 Jul 4, 2024

arkadeepsen Jul 4, 2024

arkadeepsen Jul 5, 2024

alebedev87 Jul 11, 2024

alebedev87 Jul 4, 2024

alebedev87 Jul 4, 2024

arkadeepsen Jul 4, 2024

arkadeepsen Jul 5, 2024

openshift-ci bot commented Jul 5, 2024

alebedev87 commented Jul 12, 2024

openshift-ci bot commented Jul 12, 2024

openshift-ci-robot commented Jul 12, 2024

arkadeepsen commented Jul 16, 2024

openshift-cherrypick-robot commented Jul 16, 2024

	// Check if the grace period is over for some of the IP addresses after the expiration of their respective TTLs. If so,
	// remove those IP addresses and update the status of the resource.
	if removalOfIPsRequired(&dnsNameResolver.Status) {
	if err := r.client.Status().Update(ctx, dnsNameResolver); err != nil {
	return reconcile.Result{}, err
	}
	return reconcile.Result{}, nil
	}

	// Check if the TTLs of some of the IP addresses have expired. If so, requeue
	// the reconcile request after the minimum remaining time until the grace
	// period gets over among that of the IP addresses with expired TTLs.
	if ttlExpired, remainingTime := reconcileRequired(&dnsNameResolver.Status); ttlExpired {
	return reconcile.Result{Requeue: true, RequeueAfter: remainingTime}, nil
	}

OCPBUGS-33750: Fix DNSNameResolver object status update issue #8

OCPBUGS-33750: Fix DNSNameResolver object status update issue #8

Conversation

arkadeepsen commented Jun 5, 2024 • edited Loading

openshift-ci-robot commented Jun 5, 2024

arkadeepsen commented Jun 5, 2024

openshift-ci-robot commented Jun 5, 2024

arkadeepsen commented Jun 5, 2024

candita commented Jun 5, 2024

openshift-ci-robot commented Jun 7, 2024

arkadeepsen commented Jul 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Jul 5, 2024

alebedev87 commented Jul 12, 2024

openshift-ci bot commented Jul 12, 2024

openshift-ci-robot commented Jul 12, 2024

arkadeepsen commented Jul 16, 2024

openshift-cherrypick-robot commented Jul 16, 2024

arkadeepsen commented Jun 5, 2024 •

edited

Loading