Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-33750: Fix DNSNameResolver object status update issue #8

Merged
merged 2 commits into from
Jul 12, 2024

Conversation

arkadeepsen
Copy link
Member

@arkadeepsen arkadeepsen commented Jun 5, 2024

This PR fixes the issue with DNSNameResolver object status update.

The DNSNameResolver controller was sending DNS requests for a DNS name whose IP has expired at an interval 1ms. As the update event of the DNSNameResolver object is not received by the DNSNameResolver controller within the next 1ms, the controller again sent the DNS request. This resulted in the creation of numerous DNS requests within a very short period of time. The interval is changed to 2 times of the default minimum TTL (5 seconds). This interval will account for the time required for getting the update event by the controller as well as the grace period (5 seconds) to remove any IP address whose TTL has expired. This will avoid sending any extra DNS requests after the first DNS request post the TTL expiration. Additionally, if for any DNS name the latest queries did not return any new address which results in removal of the associated IP addresses after TTL expiration, then the next lookup time for such DNS names is set to default maximum TTL (30 minutes). This will avoid creation of DNS requests for the DNS name at an interval of 2 times of the default minimum TTL.

On the CoreDNS plugin side, the RetryOnConflict block was using the lister to get the DNSNameResolver object and then the client to update the status. However, if a conflict occurs the lister does not get the updated object immediately and the update again fails due to conflict. To minimize the conflict error on update, the client will be used to get the latest object instead of the lister if the resourceVersion of the DNSNameResolver object is same as the previous GET call (inspired by openshift/library-go#1668).

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 5, 2024
@openshift-ci-robot
Copy link

@arkadeepsen: This pull request references Jira Issue OCPBUGS-33750, which is invalid:

  • expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR fixes the issue with DNSNameResolver object status update.

The DNSNameResolver controller was sending DNS requests for a DNS name whose IP has expired at an interval 1ms. As the update event of the DNSNameResolver object is not received by the DNSNameResolver controller within the next 1ms, the controller again sent the DNS request. This resulted in the creation of numerous DNS requests within a very short period of time. The interval is changed to 2 times of the default minimum TTL (5 seconds). This interval will account for the time required for getting the update event by the controller as well as the grace period (5 seconds) to remove any IP address whose TTL has expired. This will avoid sending any extra DNS requests after the first DNS request post the TTL expiration. Additionally, if for any DNS name the latest queries did not return any new address which results in removal of the associated IP addresses after TTL expiration, then the next lookup time for such DNS names is set to default maximum TTL (30 minutes). This will avoid creation of DNS requests for the DNS name at an interval of 2 times of the default minimum TTL.

On the CoreDNS plugin side, the RetryOnConflict block was using the cache to get the DNSNameResolver object and then the client to update the status. However, if a conflict occurs the cache does not get the updated object immediately and the update again fails due to conflict. To minimize the conflict error on update, the client will be used to get the latest object instead of the cache.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from knobunc and Miciah June 5, 2024 06:00
@arkadeepsen
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 5, 2024
@openshift-ci-robot
Copy link

@arkadeepsen: This pull request references Jira Issue OCPBUGS-33750, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from huiran0826 June 5, 2024 06:00
@arkadeepsen
Copy link
Member Author

openshift/release#52809 updates the golang version to 1.22 for the build_root in the CI config. Adding a hold until it merges.
/hold

@candita
Copy link

candita commented Jun 5, 2024

/assign @alebedev87
/assign

@arkadeepsen arkadeepsen force-pushed the check-not-found-error branch 2 times, most recently from d29acd8 to 133ce95 Compare June 7, 2024 11:32
@openshift-ci-robot
Copy link

@arkadeepsen: This pull request references Jira Issue OCPBUGS-33750, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

In response to this:

This PR fixes the issue with DNSNameResolver object status update.

The DNSNameResolver controller was sending DNS requests for a DNS name whose IP has expired at an interval 1ms. As the update event of the DNSNameResolver object is not received by the DNSNameResolver controller within the next 1ms, the controller again sent the DNS request. This resulted in the creation of numerous DNS requests within a very short period of time. The interval is changed to 2 times of the default minimum TTL (5 seconds). This interval will account for the time required for getting the update event by the controller as well as the grace period (5 seconds) to remove any IP address whose TTL has expired. This will avoid sending any extra DNS requests after the first DNS request post the TTL expiration. Additionally, if for any DNS name the latest queries did not return any new address which results in removal of the associated IP addresses after TTL expiration, then the next lookup time for such DNS names is set to default maximum TTL (30 minutes). This will avoid creation of DNS requests for the DNS name at an interval of 2 times of the default minimum TTL.

On the CoreDNS plugin side, the RetryOnConflict block was using the lister to get the DNSNameResolver object and then the client to update the status. However, if a conflict occurs the lister does not get the updated object immediately and the update again fails due to conflict. To minimize the conflict error on update, the client will be used to get the latest object instead of the lister if the resourceVersion of the DNSNameResolver object is same as the previous GET call.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@arkadeepsen arkadeepsen force-pushed the check-not-found-error branch 2 times, most recently from c030e52 to 5c054f4 Compare June 13, 2024 06:36
@arkadeepsen
Copy link
Member Author

/hold cancel
openshift/release#52809 is merged

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 4, 2024
handler.go Outdated
return err
}
resourceVersion = resolverObj.GetResourceVersion()
log.Warningf("lister was stale at resourceVersion=%v, live get showed resourceVersion=%v", listerResourceVersion, resourceVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the user to do something about it? And can we do something about it? If no, it should be just an Infof.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to Infof.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Infof.

Comment on lines 117 to 119
// A DNS lookup request has been sent upon TTL expiration of the DNS name. Reset the timer to wait until twice of default
// minimum TTL to perform the next lookup.
timeTillNextLookup = 2 * defaultMinTTL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm getting this one. Do we kinda let the expired records to be removed instead of trying to save them resending the dns request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be nice to group the logic of the next lookup computation into a dedicated function which would select on the channels and would give the next lookup time. This would allow us to unit test it and clearly see the use cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm getting this one. Do we kinda let the expired records to be removed instead of trying to save them resending the dns request?

When the next lookup time is reached, we get a signal on the timer.C channel in line no. 88 . The lookup for the DNS name happens in line no. 92. So when we reach here, the lookup request is already sent and we should wait for the update event for the corresponding DNSNameResolver object.

When we receive the update event, we also remove the expired IP addresses after a grace period of defaultMinTTL during reconciliation. The following code does this work:

// Check if the grace period is over for some of the IP addresses after the expiration of their respective TTLs. If so,
// remove those IP addresses and update the status of the resource.
if removalOfIPsRequired(&dnsNameResolver.Status) {
if err := r.client.Status().Update(ctx, dnsNameResolver); err != nil {
return reconcile.Result{}, err
}
return reconcile.Result{}, nil
}
// Check if the TTLs of some of the IP addresses have expired. If so, requeue
// the reconcile request after the minimum remaining time until the grace
// period gets over among that of the IP addresses with expired TTLs.
if ttlExpired, remainingTime := reconcileRequired(&dnsNameResolver.Status); ttlExpired {
return reconcile.Result{Requeue: true, RequeueAfter: remainingTime}, nil
}

So, this block of code will not receive the new IPs/TTLs until the expired IPs are removed. If it doesn't wait until 2 * defaultMinTTL, unnecessary lookup requests will again be sent to the CoreDNS pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated the code for computing time till next lookup into a function and added unit test for the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the next lookup time is reached, we get a signal on the timer.C channel in line no. 88 . The lookup for the DNS name happens in line no. 92. So when we reach here, the lookup request is already sent and we should wait for the update event for the corresponding DNSNameResolver object.

But we get into this condition only when the remaining duration is 0. The case you added confirms this:

		{
			name:                       "DNS exists and remaining duration is not greater than 0",
			dnsExists:                  true,
			remainingDuration:          0,
			expectedTimeTillNextLookup: 2 * defaultMinTTL,
		},

So something doesn't stick here. The remaining duration says that we need to send a new lookup asap (like it was before this PR) but we delay it to catchup with the updates.

handler.go Outdated
Comment on lines 585 to 587
var resolverObj *ocpnetworkapiv1alpha1.DNSNameResolver
var resourceVersion string
var err error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var resolverObj *ocpnetworkapiv1alpha1.DNSNameResolver
var resourceVersion string
var err error
var (
resolverObj *ocpnetworkapiv1alpha1.DNSNameResolver
resourceVersion string
err error
)

Comment on lines +308 to +318
// If there are no IP addresses associated with the DNS name and the next lookup
// time of the DNS name is already past the current time, then reset the next
// lookup time to the default maximum TTL.
if resolvedName.numIPs == 0 && !time.Now().Before(resolvedName.minNextLookupTime) {
resolvedName.minNextLookupTime = time.Now().Add(defaultMaxTTL)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that this can be covered in TestResolver. Can we do a test case for this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Let me try to add that test case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test case for the same in TestResolver.

@arkadeepsen arkadeepsen force-pushed the check-not-found-error branch from 5c054f4 to cabf2cb Compare July 5, 2024 16:26
Copy link

openshift-ci bot commented Jul 5, 2024

@arkadeepsen: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@alebedev87
Copy link
Contributor

We will need to address the increasing complexity of the resolver code as the reviews become challenging. Letting the bug resolution move on.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2024
Copy link

openshift-ci bot commented Jul 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 12, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit af651ce into openshift:main Jul 12, 2024
7 checks passed
@openshift-ci-robot
Copy link

@arkadeepsen: Jira Issue OCPBUGS-33750: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-33750 has not been moved to the MODIFIED state.

In response to this:

This PR fixes the issue with DNSNameResolver object status update.

The DNSNameResolver controller was sending DNS requests for a DNS name whose IP has expired at an interval 1ms. As the update event of the DNSNameResolver object is not received by the DNSNameResolver controller within the next 1ms, the controller again sent the DNS request. This resulted in the creation of numerous DNS requests within a very short period of time. The interval is changed to 2 times of the default minimum TTL (5 seconds). This interval will account for the time required for getting the update event by the controller as well as the grace period (5 seconds) to remove any IP address whose TTL has expired. This will avoid sending any extra DNS requests after the first DNS request post the TTL expiration. Additionally, if for any DNS name the latest queries did not return any new address which results in removal of the associated IP addresses after TTL expiration, then the next lookup time for such DNS names is set to default maximum TTL (30 minutes). This will avoid creation of DNS requests for the DNS name at an interval of 2 times of the default minimum TTL.

On the CoreDNS plugin side, the RetryOnConflict block was using the lister to get the DNSNameResolver object and then the client to update the status. However, if a conflict occurs the lister does not get the updated object immediately and the update again fails due to conflict. To minimize the conflict error on update, the client will be used to get the latest object instead of the lister if the resourceVersion of the DNSNameResolver object is same as the previous GET call (inspired by openshift/library-go#1668).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@arkadeepsen
Copy link
Member Author

/cherrypick release-4.16

@openshift-cherrypick-robot

@arkadeepsen: new pull request created: #13

In response to this:

/cherrypick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants