-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-33750: Fix DNSNameResolver object status update issue #8
OCPBUGS-33750: Fix DNSNameResolver object status update issue #8
Conversation
@arkadeepsen: This pull request references Jira Issue OCPBUGS-33750, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@arkadeepsen: This pull request references Jira Issue OCPBUGS-33750, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
openshift/release#52809 updates the golang version to 1.22 for the build_root in the CI config. Adding a hold until it merges. |
0833873
to
a7e22da
Compare
/assign @alebedev87 |
d29acd8
to
133ce95
Compare
@arkadeepsen: This pull request references Jira Issue OCPBUGS-33750, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
c030e52
to
5c054f4
Compare
/hold cancel |
handler.go
Outdated
return err | ||
} | ||
resourceVersion = resolverObj.GetResourceVersion() | ||
log.Warningf("lister was stale at resourceVersion=%v, live get showed resourceVersion=%v", listerResourceVersion, resourceVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the user to do something about it? And can we do something about it? If no, it should be just an Infof
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to Infof
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to Infof
.
// A DNS lookup request has been sent upon TTL expiration of the DNS name. Reset the timer to wait until twice of default | ||
// minimum TTL to perform the next lookup. | ||
timeTillNextLookup = 2 * defaultMinTTL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm getting this one. Do we kinda let the expired records to be removed instead of trying to save them resending the dns request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be nice to group the logic of the next lookup computation into a dedicated function which would select
on the channels and would give the next lookup time. This would allow us to unit test it and clearly see the use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm getting this one. Do we kinda let the expired records to be removed instead of trying to save them resending the dns request?
When the next lookup time is reached, we get a signal on the timer.C
channel in line no. 88 . The lookup for the DNS name happens in line no. 92. So when we reach here, the lookup request is already sent and we should wait for the update event for the corresponding DNSNameResolver object.
When we receive the update event, we also remove the expired IP addresses after a grace period of defaultMinTTL
during reconciliation. The following code does this work:
coredns-ocp-dnsnameresolver/operator/controller/dnsnameresolver/controller.go
Lines 139 to 153 in a3f67ee
// Check if the grace period is over for some of the IP addresses after the expiration of their respective TTLs. If so, | |
// remove those IP addresses and update the status of the resource. | |
if removalOfIPsRequired(&dnsNameResolver.Status) { | |
if err := r.client.Status().Update(ctx, dnsNameResolver); err != nil { | |
return reconcile.Result{}, err | |
} | |
return reconcile.Result{}, nil | |
} | |
// Check if the TTLs of some of the IP addresses have expired. If so, requeue | |
// the reconcile request after the minimum remaining time until the grace | |
// period gets over among that of the IP addresses with expired TTLs. | |
if ttlExpired, remainingTime := reconcileRequired(&dnsNameResolver.Status); ttlExpired { | |
return reconcile.Result{Requeue: true, RequeueAfter: remainingTime}, nil | |
} |
So, this block of code will not receive the new IPs/TTLs until the expired IPs are removed. If it doesn't wait until 2 * defaultMinTTL
, unnecessary lookup requests will again be sent to the CoreDNS pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separated the code for computing time till next lookup into a function and added unit test for the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the next lookup time is reached, we get a signal on the timer.C channel in line no. 88 . The lookup for the DNS name happens in line no. 92. So when we reach here, the lookup request is already sent and we should wait for the update event for the corresponding DNSNameResolver object.
But we get into this condition only when the remaining duration is 0
. The case you added confirms this:
{
name: "DNS exists and remaining duration is not greater than 0",
dnsExists: true,
remainingDuration: 0,
expectedTimeTillNextLookup: 2 * defaultMinTTL,
},
So something doesn't stick here. The remaining duration says that we need to send a new lookup asap (like it was before this PR) but we delay it to catchup with the updates.
handler.go
Outdated
var resolverObj *ocpnetworkapiv1alpha1.DNSNameResolver | ||
var resourceVersion string | ||
var err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var resolverObj *ocpnetworkapiv1alpha1.DNSNameResolver | |
var resourceVersion string | |
var err error | |
var ( | |
resolverObj *ocpnetworkapiv1alpha1.DNSNameResolver | |
resourceVersion string | |
err error | |
) |
// If there are no IP addresses associated with the DNS name and the next lookup | ||
// time of the DNS name is already past the current time, then reset the next | ||
// lookup time to the default maximum TTL. | ||
if resolvedName.numIPs == 0 && !time.Now().Before(resolvedName.minNextLookupTime) { | ||
resolvedName.minNextLookupTime = time.Now().Add(defaultMaxTTL) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that this can be covered in TestResolver
. Can we do a test case for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Let me try to add that test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test case for the same in TestResolver
.
…lver object is different from the previous one in CoreDNS plugin
5c054f4
to
cabf2cb
Compare
@arkadeepsen: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
We will need to address the increasing complexity of the resolver code as the reviews become challenging. Letting the bug resolution move on. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alebedev87 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@arkadeepsen: Jira Issue OCPBUGS-33750: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-33750 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cherrypick release-4.16 |
@arkadeepsen: new pull request created: #13 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This PR fixes the issue with DNSNameResolver object status update.
The DNSNameResolver controller was sending DNS requests for a DNS name whose IP has expired at an interval 1ms. As the update event of the DNSNameResolver object is not received by the DNSNameResolver controller within the next 1ms, the controller again sent the DNS request. This resulted in the creation of numerous DNS requests within a very short period of time. The interval is changed to 2 times of the default minimum TTL (5 seconds). This interval will account for the time required for getting the update event by the controller as well as the grace period (5 seconds) to remove any IP address whose TTL has expired. This will avoid sending any extra DNS requests after the first DNS request post the TTL expiration. Additionally, if for any DNS name the latest queries did not return any new address which results in removal of the associated IP addresses after TTL expiration, then the next lookup time for such DNS names is set to default maximum TTL (30 minutes). This will avoid creation of DNS requests for the DNS name at an interval of 2 times of the default minimum TTL.
On the CoreDNS plugin side, the
RetryOnConflict
block was using the lister to get the DNSNameResolver object and then the client to update the status. However, if a conflict occurs the lister does not get the updated object immediately and the update again fails due to conflict. To minimize the conflict error on update, the client will be used to get the latest object instead of the lister if the resourceVersion of the DNSNameResolver object is same as the previous GET call (inspired by openshift/library-go#1668).